【原创】大数据基础之Impala(1)简介、安装、使用

impala2.12

官方:http://impala.apache.org/

一 简介

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

impala是hadoop上的开源分析性数据库;

  • Do BI-style Queries on Hadoop
    • Impala provides low latency and high concurrency for BI/analytic queries on Hadoop (not delivered by batch frameworks such as Apache Hive). Impala also scales linearly, even in multitenant environments.

impala支持hadoop上低延迟和高并发的查询。

  • Unify Your Infrastructure
    • Utilize the same file and data formats and metadata, security, and resource management frameworks as your Hadoop deployment—no redundant infrastructure or data conversion/duplication.

使用同样的文件、格式和元数据。

  • Implement Quickly
    • For Apache Hive users, Impala utilizes the same metadata and ODBC driver. Like Hive, Impala supports SQL, so you don't have to worry about re-inventing the implementation wheel.

对于hive用户来说,impala使用相同的元数据和driver,支持sql。

Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.

impala直接基于hadoop数据(hdsf、hbase等)实现快速的、交互式的sql查询;impala使用与hive相同的存储平台、元数据、sql语法、driver和ui,这样实现了实时查询和批处理查询的统一;

Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

impala是一个大数据查询工具集的有力补充,impala不替换现有的批处理框架比如hive(hive通常用来执行一些ETL任务);

Impala provides:

  • Familiar SQL interface that data scientists and analysts already know.
  • Ability to query high volumes of data ("big data") in Apache Hadoop.
  • Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity hardware.
  • Ability to share data files between different components with no copy or export/import step; for example, to write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling simple data interchange using Impala for analytics on Hive-produced data.
  • Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for analytics.

impala架构

The Impala server is a distributed, massively parallel processing (MPP) database engine.

1 Impala Daemon

The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work across the cluster; and transmits intermediate query results back to the central coordinator node.

impala deamon和数据节点部署在一起,负责读写数据、响应impala-shell/Hue/JDBC请求、分布式查询、返回查询结果,部署多个;

2 Impala Statestore

The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored; you only need such a process on one host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable node.

impala statestore检查和记录impala deamon服务器的健康情况,这样查询时可以踢掉不健康的节点,只需要部署1个。

3 Impala Catalog Service

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. It is physically represented by a daemon process named catalogd; you only need such a process on one host in the cluster. Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host.

impala catalog负责元数据,只需要1个。

客户端

  • The impala-shell interactive command interpreter.
  • The Hue web-based user interface.
  • JDBC.

二 安装

安装支持3种方式:

1 Cloudera Manager安装

页面操作

2 Ambari安装

详见 https://www.cnblogs.com/barneywill/p/10290849.html

3 手工安装 

1 增加repo

# cat /etc/yum.repos.d/cdh.repo

[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 7 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/5/
gpgkey =https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/RPM-GPG-KEY-cloudera 
gpgcheck = 1

2 安装

# yum install impala impala-catalog impala-server impala-state-store impala-shell 

也可以细分安装

catalog 安装
# yum install impala impala-catalog

server安装
# yum install impala impala-server

statestore安装
# yum install impala impala-state-store

客户端安装
# yum install impala-shell

配置文件

/etc/default/impala

hadoop、hive、hbase等配置文件(core-site.xml、hdfs-site.xml、hive-site.xml、hbase-site.xml)放到

/etc/impala/conf/

启动命令

service impala-statestore start
service impala-catalog start
service impala-server start

注意:impala需要用到hive的元数据,2.12支持hive2及以下,不支持hive3; 

 impala server页面

三 使用

使用impala-shell

$ impala-shell -i $impala_server:21000
Starting Impala Shell without Kerberos authentication
Connected to $impala_server:21000
Server version: impalad version 2.12.0-cdh5.16.1 RELEASE (build 4a3775ef6781301af81b23bca45a9faeca5e761d)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.12.0-cdh5.16.1 (4a3775e) built on Wed Nov 21 21:02:28 PST 2018)

When you set a query option it lasts for the duration of the Impala shell session.
***********************************************************************************
[$impala_server:21000] >

连接成功之后像hive一样使用;

猜你喜欢

转载自www.cnblogs.com/barneywill/p/10296502.html