[Turn] the relationship between Impala and the Hive

Transfer from https://www.cnblogs.com/zlslch/p/6785207.html?utm_source=itdadao&utm_medium=referral

Impala and the relationship Hive 

  Impala is based on real-time big data analytics Hive query engine , direct use of meta-Hive database Metadata, meaning impala metadata stored in the Hive metastore in. And impala compatible Hive of sql resolve to achieve the Hive subset of SQL semantics functionality is continually perfected.

Relations with the Hive

  Hive Impala and data query tools are built on top of Hadoop vary to adapt to the focus, but the client uses the Impala view Hive have many in common, the data table metadata, ODBC / JDBC Driver, SQL syntax, flexible file format, storage resource pools. Hive Impala and in relation to the following Diagram Hadoop FIG. Hive suitable for long batch analysis , and the Impala is suitable for real-time interactive SQL queries , data analysts Impala to provide a quick experiment to test the idea of big data analysis tools. Hive can first use the data conversion process, after use Impala rapid data analysis on the results of the data set Hive treatment.

 

            

 

 

 

 

 

 

Impala respect optimization techniques used Hive

  • 1, MapReduce is not used for parallel computing, although it is very good MapReduce parallel computing framework, but it is more batch-oriented mode, but not for the interactive SQL execution. Compared with MapReduce: Impala entire query execution plan into a tree, rather than a series of MapReduce tasks, after the implementation of the distribution plan, Impala use pull-acquisition mode of data acquisition result, the composition of the resulting data transfer by execution tree collection streaming , the step of reducing the intermediate results written to disk, and then read data from the disk overhead. Impala way to avoid using the services you need to start every query execution overhead, compared Hive that is not the MapReduce start time.
  • 2, run the code generated using LLVM to generate specific code for a particular query, while reducing the overhead of a function call using Inline way to speed up the execution efficiency.
  • 3, full use of the available hardware instruction (SSE4.2).
  • 4, IO better scheduling, disk Impala know the position where the data blocks can be utilized better advantage of multiple-disk, while the read data block Impala supports direct calculation of the local code and the checksum.
  • 5, the best performance can be obtained by selecting an appropriate data storage format (Impala supports multiple storage formats).
  • 6, maximum use of memory, intermediate results do not write the disk, timely delivery way to stream over the network.

 

 

 

 

 

 

 

 

 

 

Impala and the similarities and differences Hive

  • Data storage : use the same storage pool data support the data stored in HDFS, HBase.
  • Metadata : both use the same metadata.
  • SQL explain treatment : more similar execution plans are generated by the lexical analysis.

 

  Implementation plan :

  • Hive: depends on the implementation of MapReduce framework, the implementation plan is divided into map-> shuffle-> reduce-> map-> shuffle-> reduce ... model. If a Query is compiled into several rounds of MapReduce, then there will be more to write intermediate results. Because the characteristics of the frame itself MapReduce performed, excessive intermediate process increases the overall execution time of the Query.
  • Impala: The Plan of Implementation of the performance of a full implementation plan tree can be more natural to distribute to various Impalad implementation plan to execute the query, but do not like the Hive as it combined into a pipeline type map-> reduce model, in order to ensure Impala better concurrency and avoid unnecessary intermediate sort and shuffle.

 

  Data flow :

  • Hive: using push manner, each computing node proactively pushed after completion of the calculation data to the subsequent node.
  • Impala: use pull mode, subsequent to the data node to the front getNext active node, in this way the data can be streamed back to the client, and as long as there is a data to be processed, can show up immediately, without waiting for all processing is complete, more in line with interactive SQL queries.

 

  Memory usage :

  • Hive: during execution if the memory does not fit all the data, the external memory will be used to ensure complete Query can be performed sequentially. MapReduce end of each round, the intermediate results are also written in HDFS, MapReduce is also due to the characteristics of the execution architecture, shuffle process there will be a local disk write operations.
  • Impala: when it encounters memory does not fit the data, the current version 1.0.1 is a direct return an error, without the use of external memory, future versions should be improved. This process is currently using the Query Impala was subject to certain restrictions, it is best used in conjunction with the Hive. Impala use of the network to transfer data between a plurality of stages, no disk write operations (insert excluded) during execution.

 

  Scheduling :

  • Hive: task scheduling depends on the scheduling policy of Hadoop.
  • Impala: scheduling done by yourself, there is only one scheduler simple-schedule, it will try to meet local, process data scanned data as close to the physical machine where the data itself. The scheduler is still relatively simple, can be seen in SimpleScheduler :: GetBackend in, there is no consideration load, network status IO scheduling. But the Impala has statistical analysis of the performance of the implementation process, we should later will use these statistics to schedule it.

 

  Fault Tolerance :

  • Hive: relies on Hadoop fault tolerance.
  • Impala: In the query process, no fault logic, if a failure occurs during execution, an error is returned directly (which is related to the design of the Impala, Impala because positioning in real-time queries, query was unsuccessful, and then check just once, and then check once the cost is very low). But overall, Impala is well fault tolerance, all of Impalad is equivalent structure, users can submit queries to any Impalad, if a Impalad failure, all of which are running Query will fail, but the user can resubmit replaced by other Impalad query execution, will not affect the service. For State Store currently has only one, but when the State Store fail, it will not affect the service, each Impalad are cached information State Store, but can not update the status of the cluster, it is possible to perform the task assigned to Impalad would have failed execution , leading to this Query failed.

 

  Applicable surface :

  • Hive: complex batch query tasks, data conversion tasks.
  • Impala: real-time data analysis, because the problem does not support the UDF, can handle domain has certain limitations, in conjunction with the Hive, the result of the Hive data sets in real time analysis.

 

 

 

 

 

 

 

 

  Impala and Hive is built on Hadoop data query tool above, but have different emphases, then we why use these two tools it ? Hive Impala used alone or can not it?

 

 

I. INTRODUCTION Impala and Hive

  (. 1) are provided on the Impala and Hive HDFS / Hbase data SQL query tool, converted to the MapReduce Hive, by means YARN scheduling data enabling access to the HDFS, and Impala directly HDFS data query. But they are provided as standard SQL statements run in the fuselage.

            

 

  (2)Apache Hive是MapReduce的高级抽象,使用HiveQL,Hive可以生成运行在Hadoop集群的MapReduce或Spark作业。Hive最初由Facebook大约在2007年开发,现在是Apache的开源项目。

  Apache Impala是高性能的专用SQL引擎,使用Impala SQL,因为Impala无需借助任何的框架,直接实现对数据块的查询,所以查询延迟毫秒级。Impala受到Google的Dremel项目启发,2012年由Cloudera开发,现在是Apache开源项目。

 

 

 

 

二、Impala和Hive有什么不同?

  (1)Hive有很多的特性:

    1、对复杂数据类型(比如arrays和maps)和窗口分析更广泛的支持

    2、高扩展性

    3、通常用于批处理

  (2)Impala更快

    1、专业的SQL引擎,提供了5x到50x更好的性能

    2、理想的交互式查询和数据分析工具

    3、更多的特性正在添加进来

 

 

 

 

三、高级概述:

        

 

 

 

四、为什么要使用Hive和Impala?

  1、为数据分析人员带来了海量数据分析能力,不需要软件开发经验,运用已掌握的SQL知识进行数据的分析。

  2、比直接写MapReduce或Spark具有更好的生产力,5行HiveQL/Impala SQL等同于200行或更多的Java代码。

  3、提供了与其他系统良好的互操作性,比如通过Java和外部脚本扩展,而且很多商业智能工具支持Hive和Impala。

 

 

 

五、Hive和Impala使用案例

  (1)日志文件分析

  Log is a common data type, the current is large data important age data source structure is not fixed by the log collection kafka Flume and into the HDFS, and then analyze the structure of the log, to establish a log table according to the delimiter, then Impala down and use Hive to analyze the data. E.g:

      

 

 

 

  (2) Sentiment Analysis

  Many organizations use Hive or Impala to analyze social media coverage. E.g:

          

 

 

 

 

  (3) Business Intelligence

  Many leading BI tools support Hive and Impala

      

 

Editor:

Author: big data and artificial intelligence had been lying in pit
Source: http://www.cnblogs.com/zlslch/

Guess you like

Origin www.cnblogs.com/yanwuliu/p/10942861.html