Impala: a new generation of open source big data analysis engine

Big data processing is a very important issue in cloud computing . Since Google proposed the MapReduce distributed processing framework, open source software represented by Hadoop has been valued and favored by more and more companies. Based on Hadoop, Hbase , Hive , Pig and other systems have sprung up into the Hadoop ecosystem. Today we are going to talk about a new member of the Hadoop system - Impala.

 

Impala Architecture Analysis

Impala is a new query system led by Cloudera. It provides SQL semantics and can query petabyte-scale big data stored in Hadoop's HDFS and HBase. Although the existing Hive system also provides SQL semantics, because the underlying execution of Hive uses the MapReduce engine, it is still a batch process, which is difficult to meet the interactivity of the query. In contrast, the biggest feature and biggest selling point of Impala is that its fast. So how does Impala realize fast query of big data? Before answering this question, we need to introduce Google's Dremel system [1], because Impala was originally designed with reference to the Dremel system.

 Dremel is Google's interactive data analysis system. It is built on Google's GFS (Google File System) and other systems, and supports Google's data analysis service BigQuery and many other services. There are two main technical highlights of Dremel: one is to realize the column storage of nested data; the other is to use a multi-layer query tree, so that tasks can be executed in parallel on thousands of nodes and aggregated results. Column storage is not unfamiliar in relational databases , it can reduce the amount of data processed during query and effectively improve query efficiency. Dremel's column store is different in that it is not for traditional relational data, but for data with nested structures. Dremel can convert records of nested structure into column storage form. When querying, read the required columns according to the query conditions, then filter the conditions, and then assemble the columns into nested records for output. Both forward and reverse transitions are implemented through efficient state machines. On the other hand, Dremel's multi-layer query tree draws on the design of distributed search engines . The root node of the query tree is responsible for receiving queries and distributing them to the next layer of nodes. The bottom node is responsible for specific data reading and query execution. , and then return the result to the upper node. For more information on the implementation of Dremel technology, the reader is referred to [9].

 Impala is actually Dremel of Hadoop, and the column storage format used by Impala is Parquet. Parquet implements column storage in Dremel, and will support Hive in the future and add functions such as dictionary encoding and run-length encoding. The system architecture of Impala is shown in Figure 1. Impala uses Hive's SQL interface (including operations such as SELECT, INSERT, Join, etc.), but currently only implements a subset of Hive's SQL semantics (for example, UDF is not yet supported), and table metadata information is stored in Hive's Metastore . StateStore is a sub-service of Impala, which is used to monitor the health status of each node in the cluster, and provide functions such as node registration and error detection. Impala runs a background service impalad on each node, which is used to respond to external requests and complete the actual query processing. Impalad mainly includes three modules: Query Planner, Query Coordinator and Query Exec Engine. QueryPalnner receives queries from SQL APP and ODBC, and then converts the query into many sub-queries. The Query Coordinator distributes these sub-queries to each node. The Query Exec Engine on each node is responsible for the execution of the sub-queries, and finally returns the sub-queries. As a result, these intermediate results are aggregated and finally returned to the user.

 figure 1

Figure 1. System architecture diagram of Impala [2]

In Cloudera's test , Impala's query efficiency is orders of magnitude higher than Hive. From a technical point of view, Impala can have good performance mainly for the following reasons:

 1) Impala does not need to write intermediate results to disk, saving a lot of I/O overhead.

2) Saves the overhead of MapReduce job startup. The speed at which MapReduce starts tasks is very slow (the default interval for each heartbeat is 3 seconds). Impala performs job scheduling directly through the corresponding service process, which is much faster.

3) Impala完全抛弃了MapReduce这个不太适合做SQL查询的范式,而是像Dremel一样借鉴了MPP并行数据库的思想,从新另起炉灶,因此可以做更多的查询优化,从而能省掉不必要的shuffle,sort等开销;

4) 通过使用LLVM来统一编译运行时代码,避免了为支持通用编译而带来的不必要开销;

5) 用C++实现,做了很多有针对性的硬件优化,例如使用SSE指令;

6) 使用了支持Data locality的I/O调度机制,尽可能的将数据和计算分配在同一台机器上进行,减少了网络开销;

虽然Impala是参照Dremel来实现,但是Impala也有一些自己的特色,例如Impala不仅仅支持Parquet格式,同时也可以直接处理文本,SequenceFile等Hadoop中常用的文件格式。另外一个更关键的地方在于,Impala是开源的,再加上Cloudera在Hadoop领域的领导地位,其生态圈有很大可能会在将来快速成长。可以预见在不久的未来,Impala很可能像之前的Hadoop和Hive一样在大数据处理领域大展拳脚。Cloudera自己也说期待未来Impala能完全取代Hive。当然,用户从Hive上迁移到Impala上来是需要时间的,而且Impala也只是刚刚发布1.0版,虽然号称已经可以稳定的在生产环境上运行,但相信仍然有很多可改进的空间[7]。需要说明的是,Impala并不是用来取代已有的MapReduce系统,而是作为MapReduce的一个强力补充,总的来说Impala适合用来处理输出数据适中或比较小的查询,而对于大数据量的批处理任务,MapReduce依然是更好的选择。另外一个花边消息是,Cloudera里负责Impala的架构师Marcel Komacker就曾在Google负责过F1系统的查询引擎开发,可见Google确实为大数据的流行出钱出力J

Impala与Shark,Drill等的比较

开源组织Apache也发起了名为Drill的项目来实现Hadoop上的Dremel,目前该项目正在开发当中,相关的文档和代码还不多,可以说暂时还未对Impala构成足够的威胁[10]。从Quora上的问答来看,Cloudera有7-8名工程师全职在Impala项目上,而相比之下Drill目前的动作稍显迟钝。具体来说,截止到2012年10月底,Drill的代码库里实现了query parser, plan parser,及能对JSON格式的数据进行扫描的plan evaluator;而Impala同期已经有了一个比较完毕的分布式query execution引擎,并对HDFS和HBase上的数据读入,错误检测,INSERT的数据修改,LLVM动态翻译等都提供了支持。当然,Drill作为Apache的项目,从一开始就避免了某个vendor的一家独大,而且对所有Hadoop流行的发行版都会做相应的支持,不像Impala只支持Cloudera自己的发行版CDH。从长远来看,谁会占据上风还真不一定[10]。

除此之外,加州伯克利大学AMPLab也开发了名为Shark的大数据分析系统。在今天6月份的《程序员》上有一篇专门分析与Shark相关的Spark系统的文章,感兴趣的读者朋友可以参考。从长远目标来看,Shark想成为一个既支持大数据SQL查询,又能支持高级数据分析任务的一体化数据处理系统。从技术实现的角度上来看,Shark基于Scala语言的算子推导实现了良好的容错机制,因此对失败了的长任务和短任务都能从上一个“快照点”进行快速恢复。相比之下,Impala由于缺失足够强大的容错机制,其上运行的任务一旦失败就必须“从头来过”,这样的设计必然会在性能上有所缺失。而且Shark是把内存当作第一类的存储介质来做的系统设计,所以在处理速度上也会有一些优势[11]。实际上,AMPLab最近对Hive,Impala,Shark及Amazon采用的商业MPP数据库Redshift进行了一次对比试验,在Scan Query,Aggregation Query和Join Query三种类型的任务中对它们进行了比较。图2就是AMPLab报告中Aggregation Query的性能对比。在图中我们可以看到,商业版本的Redshift的性能是最好的, Impala和Shark则各有胜负,且两者都比Hive的性能高出了一大截。更多相关的实验结果读者朋友可以参考[12]。

figure 2

图2. Redshift,Impala,Shark与Hive的Aggregation Query性能对比 [12]

以笔者愚见,其实对大数据分析的项目来说,技术往往不是最关键的。例如Hadoop中的MapReduce和HDFS都是源于Google,原创性较少。事实上,开源项目的生态圈,社区,发展速度等,往往在很大程度上会影响Impala和Shark等开源大数据分析系统的发展。就像Cloudera一开始就决定会把Impala开源,以期望利用开源社区的力量来推广这个产品;Shark也是一开始就开源了出来,更不用说Apache的Drill更是如此。说到底还是谁的生态系统更强的问题。技术上一时的领先并不足以保证项目的最终成功。虽然最后那一款产品会成为事实上的标准还很难说,但是,我们唯一可以确定并坚信的一点是,大数据分析将随着新技术的不断推陈出新而不断普及开来,这对用户永远都是一件幸事。举个例子,如果读者注意过下一代Hadoop(YARN)的发展的话就会发现,其实YARN已经支持MapReduce之外的计算范式(例如Shark,Impala等),因此将来Hadoop将可能作为一个兼容并包的大平台存在,在其上提供各种各样的数据处理技术,有应对秒量级查询的,有应对大数据批处理的,各种功能应有尽有,满足用户各方面的需求。

未来展望

其实除了Impala,Shark,Drill这样的开源方案外,像Oracle,EMC等传统厂商也没在坐以待毙等着自己的市场被开源软件侵吞。像EMC就推出了HAWQ系统,并号称其性能比之Impala快上十几倍,而前面提到的Amazon的Redshift也提供了比Impala更好的性能。虽然说开源软件因为其强大的成本优势而拥有极其强大的力量,但是传统数据库厂商仍会尝试推出性能、稳定性、维护服务等指标上更加强大的产品与之进行差异化竞争,并同时参与开源社区、借力开源软件来丰富自己的产品线、提升自己的竞争力,并通过更多的高附加值服务来满足某些消费者需求。毕竟,这些厂商往往已在并行数据库等传统领域积累了大量的技术和经验,这些底蕴还是非常深厚的。甚至现在还有像NuoDB(一个创业公司)这样号称即支持ACID,又有Scalability的NewSQL系统出来。总的来看,未来的大数据分析技术将会变得越来越成熟、越来越便宜、越来越易用;相应的,用户将会更容易更方便地从自己的大数据中挖掘出有价值的商业信息。

参考资料

[1]http://research.google.com/pubs/pub36632.html

[2]http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

[3]http://www.slideshare.net/cloudera/data-science-on-hadoop

[4] Impala重点问题列表:http://yuntai.1kapp.com/?p=1089

[5] Hive原理与不足:http://www.ccplat.com/?p=1035

[6] Impala/Hive现状分析与前景展望:http://yanbohappy.sinaapp.com/?p=220

[7] What’s next for Cloudera Impala:http://blog.cloudera.com/blog/2012/12/whats-next-for-cloudera-impala/

[8] MapReduce:一个巨大的倒退:http://t.cn/zQLFnWs

[9] Google Dremel 原理 — 如何能3秒分析1PB:http://www.yankay.com/google-dremel-rationale/

[10] Isn’t Cloudera Impala doing the same job as Apache Drill incubator project? http://www.quora.com/Cloudera-Impala/Isnt-Cloudera-Impala-doing-the-same-job-as-Apache-Drill-incubator-project

[11] Shark:https://github.com/amplab/shark/wiki

[12] Big Data Benchmark: https://amplab.cs.berkeley.edu/benchmark/

[13] Impala wiki:http://dirlt.com/impala.html

[14]How does Impala compare to Shark: http://www.quora.com/Apache-Hadoop/How-does-Impala-compare-to-Shark

[15] EMC explains Hawq SQL performance: left hand Hive right hand Impala:  http://stor-age.zdnet.com.cn/stor-age/2013/0308/2147607.shtml

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326393159&siteId=291194637