Impala -. 2 architecture

Tags (separated by spaces): Impala


Impala Server component of

Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon running on a specific cluster host process.

Impala daemon

Impala core component is a daemon that runs on each cluster DataNode by impalad process. It read the data file; receiving from the impala-shell, Hue, JDBC or ODBC transmitted query; query parallelism and distribution of work throughout the cluster; intermediate transmission inquiry result returned central coordinator node.
You can submit queries to the Impala daemon running on any DataNode, and examples of the daemon as a coordinator node of the query. The other nodes send the results back to the part of the coordinator, the coordinator of the final result set for constructing a query. When the command runs with a test function by impala-shell, for convenience, you may always connected to the same Impala daemon. For a cluster running production workloads, you can each query submitted in round robin manner to different Impala daemon to perform load balancing by using JDBC or ODBC interface.
Impala daemon in constant communication with statestore, in order to identify which node is healthy and can
accept new work.
Whatever Impala node in the cluster to create, change, or remove any type of object, or INSERT or LOAD DATA statement, they also receive a broadcast message from catalogd daemon (introducing Impala 1.2 in). This minimizes the background communication requirements required for this REFRESH statement or INVALIDATE METADATA cross-node coordination metadata before Impala 1.2.

Impala Statestore

Statestore component called the Impala Impala daemon checks the health of all DataNode on a cluster, and continue to deliver their findings to each daemon. It consists of a daemon named statestored to represent; you only need to perform this process on a host cluster. If a daemon Impala due to hardware failure, network errors, software problems or other reasons offline, statestore Impala notify all other daemons, so that future queries can avoid making a request to the unreachable node.

Statestore because the purpose is to provide help in case of problems, and therefore not important to the normal operation of the Impala cluster. If statestore not running or is not accessible, Impala daemon will continue to run and distribute their work as usual; cluster just become less robust, and when statestore offline can not be perceived failure of other Impala daemon. When statestore back online, it will re-establish communication with the Impala daemon, restore its monitoring function.

Most load balancing and high availability considerations apply to impalad daemon. statestored and catalogd daemons are no special requirements for high availability, because these problems do not result in data loss. If these daemons interrupted due to a specific host becomes unavailable, then
you can stop the Impala service, delete Impala StateStore and Impala Catalog Server role, add these roles to a different host, and restart the Impala service.

Impala Catalog Service

Impala component called a directory service metadata Impala SQL statement changes delivered to cluster all DataNodes. It is represented by a daemon called catalogd the physical; you only need to run this process again on a single host in the cluster. Because the request is passed through statestore daemon, so run statestored and catalogd service on the same host is more reliable.

Metadata directory service avoids executing a statement by the Impala lead to changes brought about by the need to REFRESH and INVALIDATE METADATA statement. When you create a table by Hive or loading data and other operations, you need to issue a REFRESH or INVALIDATE METADATA instructions on Impala nodes occur before executing the query.

This function is involved in many aspects of Impala:

  • See "Installing Impala" on page 24, "Upgrading Impala" and "Start Impala" on page 31 on page 30, to see the use of information catalogd daemon.
  • CREATE TABLE, and does not require REFRESH INVALIDATE METADATA INSERT statements when performing Impala or altered by other tables or data change operation.
    If such an operation is done by Hive, either directly manipulate data files in HDFS, these statements still needed. But in those cases, these statements only need to be posted on a node Impala, instead of publishing on all nodes. See REFRESH statement latest usage information about them on page 311, see 304 of INVALIDATE METADATA statement declares.

By default, metadata caching and load at start-up occurs asynchronously, so the Impala can begin accepting requests immediately. To enable the original behavior, Impala wait for all metadata loaded before accepting any requests it, set catalogd configuration options - load_catalog_in_background=false.

Most load balancing and high availability considerations apply to impalad daemon. statestored and catalogd daemons are no special requirements for high availability, because of problems these daemons will not cause data loss. If these daemons becomes unavailable, you can stop Impala service, and delete roles Impala StateStore Impala Catalog Server, add the role on other hosts, and then restart the Impala service interruption due on a particular host

Impala development program

Impala's core development language is SQL. You can also use Java or other languages to interact through standard ODBC and JDBC interfaces with Impala many business intelligence tools. For professional class
analysis, you can be supplemented by writing SQL user-defined built-in functions in C ++ or Java function (UDF).

Impala SQL dialect Overview

Impala SQL dialects with Apache Hive component (HiveQL) highly compatible for use in SQL syntax.
So, already familiar run SQL queries users will be familiar Impala queries on Hadoop infrastructure. Currently, Impala SQL support HiveQL statement, a subset of the data types and built-in functions. Impala also includes other built-in functions for common business functions to simplify the process of migration from non-SQL Hadoop system.

For users of Impala came from a traditional database or data warehouse background, there are several aspects of
SQL dialects seem very familiar:

  • SELECT statement includes familiar clauses, such as WHERE, GROUP BY, ORDER BY and WITH. You
    will find familiar concepts, such as connection, for processing the strings, numbers, dates, and built-in functions, aggregate functions, subqueries, and comparison operators, such as the IN () and BETWEEN. SELECT statement is the SQL standard compliance of the most important places.
  • From the data warehouse in the world, you will realize that the concept of partition table. One or more columns
    as the partition key, and the data referenced by a query to the physical arrangement of Click partitioning key columns in the WHERE clause can skip filtering does not match with the partition. For example, if you have 10 years of data and uses such as WHERE year = 2015, WHERE year> 2010 years or WHERE years IN (2014, 2015), Impala will skip all the non-matching data of the year, greatly reducing the query the number of I / O.
  • In Impala 1.2 and later, UDF allows you to perform custom logic and comparison and conversion during INSERT ... SELECT statement SELECT.

For the user enters Impala from traditional database or data warehouse background, the following aspects of the SQL dialect may take some practice to learn and make your proficient Hadoop environment:

  • Impala SQL queries focused on containing relatively few DML. No UPDATE, or DELETE statement. Obsolete data is normally discarded (by DROP TABLE or ALTER TABLE ... DROP PARTITION statement) or replacement (through INSERT OVERWRITE statement).
  • All data created by the INSERT statement is complete, these statements are usually bulk insert data by querying other tables. There are two variants, INSERT INTO appended to the existing data, the entire contents of INSERT OVERWRITE Alternatively or partition table (similar to the new INSERT TRUNCATE TABLE heel). Although the INSERT ... VALUES syntax to create a small amount of value in a single statement, but using INSERT ... SELECT from one table to copy and convert large amounts of data in one operation to another table much more efficient.
  • You often build Impala table definitions and data files in other environments, and additional Impala
    it can run real-time queries. And the same data file share table metadata Hadoop ecosystem with other components. In particular, Impala can be accessed or created by the Hive Hive inserted data table, Hive tables can be accessed and data generated Impala. Many other Hadoop components can be written to a file with Parquet and Avro and other formats, then you can query by Impala.
  • Since Hadoop and Impala focused on data warehousing type operations on large data sets, so Impala SQL contains some idioms that you might find in a traditional database import utility system. For example, you can create a table that reads comma-delimited or tab-delimited text file, specify the delimiter in CREATE TABLE statement. You can create read existing data files without converting them or move outside table.
  • Since the amount of data may not be read Impala complete and predictable, it is not necessary to limit the length of the string data types. For example, you can define as the STRING database column, its length is not limited, and not CHAR (1) or VARCHAR (64). (Although the Impala 2.0 and later, you can use the length of the defined type VARCHAR and CHAR.)

Impala Programming Interface Overview

You can connect to the Impala daemon submit a request in the following ways:

  • impala-shell interactive command interpreter.
  • Hue Web-based user interface.
  • JDBC。
  • ODBC.
    Using these options, you can use the Impala in a heterogeneous environment, run JDBC or ODBC applications on non-Linux platforms. You can also use the Impala using JDBC and ODBC interface to a variety of business intelligence tools combined. Each impalad daemon listening on multiple ports will be running on different nodes in the cluster to receive incoming requests. Request from impala-shell and Hue to impalad daemon will be routed through the same port. impalad daemon listens JDBC and ODBC requests on a separate port.

How Impala apply Hadoop ecosystem

Impala using the Hadoop ecosystem, many familiar components. Impala can exchange data with other Hadoop components (as consumers and producers), so it can be flexibly adapted to your ETL and ELT pipeline.

Impala and work Hive

A major goal is to Impala SQL-on-Hadoop operation fast and efficient enough to attract the user of the new category, and to open a new type Hadoop use cases. In feasible, it uses many Hadoop users already existing Apache Hive infrastructure to execute long-running batch-oriented SQL query.
In particular, the Impala table definition is stored in its conventional MySQL or PostgreSQL database, referred Metastore, i.e. Hive stores this data in the same database. Therefore, if all the columns are used Impala supported data types, file formats and compression codec, defined by the Impala Hive can access or loading table.
Initially concerned about the search function and performance means Impala can use the SELECT statement reads data using the INSERT statement than more types of written. To use the Avro, RCFile or SequenceFile file format query data, use the Load Hive data.
Impala query optimizer can use statistics table and column statistics. Initially, you use the ANALYZE TABLE statement Hive in the collection of this information; the Impala version 1.2.2 and later, use the Impala COMPUTE STATS statement. COMPUTE STATS requires less setup, more reliable, and the need to switch back and forth between impala-shell and Hive shell.

Impala and metadata Metastore

The page 17 Impala How to use the Hive together with, Impala maintains information about table definition in the central database called Metastore. Other metadata low-level features of the Impala also tracks data files:

  • HDFS physical location of files in blocks.

For a large amount of data and / or many partition table, all metadata retrieval table may be time-consuming, and in some cases take several minutes. Thus, each node will cache all these Impala metadata for future queries against the same table.
If you update the data table definitions or table, before issuing a query against the table, all other Impala daemons in the cluster must receive the latest metadata, replacing outdated meta data cache. In Impala 1.2 and later, for all DDL and DML statements issued by Impala, metadata updates are automatically coordinated by catalogd daemon. For more information, see Impala directory service on page 15.
(When a new data file to an existing table) Change the file in HDFS for DDL issued by Hive and DML, or manually, you can still use the REFRESH statement or INVALIDATE METADATA statement (for the new table, or after you have dropped, HDFS execution rebalance operation or deleting data files). Issued retrieves all the tables by the Metastore tracking of metadata INVALIDATE METADATA itself. If you know change is outside Impala only a particular table, you can for each affected table issued REFRESH table_name, to retrieve only the latest meta data tables.

Impala how to use HDFS

Impala HDFS using distributed file system as its main data storage medium. Impala dependent on HDFS provide redundancy to prevent single node on the network or a hardware interrupt. Use familiar HDFS file formats and compression codec, Impala table data is represented as a data file in HDFS physically. When the data file appears in the directory of the new table, Impala will all read them, regardless of the file name yes. New data is added to the file name is controlled by Impala in.

Impala how to use Hbase

HBase is HDFS alternatives may be used as the data storage medium Impala. It is a storage system to establish a database on HDFS, there is no built-in SQL support. Many users have Hadoop arranged therein and stores a large amount (typically sparse) data set. Impala defined by the table and map them to the equivalent HBase table, you can query HBase table by Impala, even including the connection to perform queries Impala and HBase tables. For more information, see Using Impala query HBase table (page 685).

Guess you like

Origin yq.aliyun.com/articles/704483