Status quo of open source big data query and analysis engines

Big data query analysis is one of the core issues in cloud computing. Since several papers by Google before 2006 laid the foundation for cloud computing, especially GFS, Map-Reduce, and Bigtable are known as the three cornerstones of cloud computing underlying technologies. GFS and Map-Reduce technologies directly support the birth of the Apache Hadoop project. Bigtable and Amazon Dynamo directly gave birth to the new database field of NoSQL, shaking the dominance of RDBMS in commercial databases and data warehouses for decades. FaceBook's Hive project is a data warehouse infrastructure built on Hadoop that provides a set of tools for storing, querying, and analyzing large-scale data. When we are still immersed in Google technologies such as GFS, Map-Reduce, Bigtable, etc., and understand, master, and imitate, Google has launched a number of new technologies after 2009, including: Dremel, Pregel, Percolator, Spanner and F1. Among them, Dremel has promoted the rise of real-time computing systems, Pregel has opened up a new direction of graph data computing, Percolator has made distributed incremental index updates a new standard in the field of text retrieval, and Spanner and F1 have shown us the possibility of cross-data center databases. . In Google's second wave of technology, based on Hive and Dremel, the emerging big data company Cloudera has open sourced the big data query and analysis engine Impala, Hortonworks has open sourced Stinger, and Facebook has open sourced Presto. Similar to Pregel, UC Berkeley AMPLAB Lab developed the Spark graph computing framework and open sourced the big data query and analysis engine Shark with Spark as the core. Due to the selection requirements of big data query engines in a telecom operator project, this article will briefly introduce and compare the performance of five mainstream open source big data query and analysis engines, namely Hive, Impala, Shark, Stinger and Presto. Outlook. The evolutionary maps of Hive, Impala, Shark, Stinger, and Presto are shown in Figure 1.

Big Data

Figure 1. Evolutionary map of Impala, Shark, Stinger, and Presto

Introduction to current mainstream engines

Hadoop based on Map-Reduce mode is good at data batch processing and is not particularly suitable for real-time query scenarios. Real-time queries generally use the MPP (Massively Parallel Processing) architecture, so users need to choose between Hadoop and MPP technologies. In Google's second wave of technology, some fast SQL access technologies based on Hadoop architecture have gradually gained attention. There is a new trend now that MPP and Hadoop combine to provide fast SQL access framework. Four very popular open source tools have recently come out: Impala, Shark, Stinger and Presto. It also shows what the big data world expects to support real-time queries in the Hadoop ecosystem. In general, Impala, Shark, Stinger, and Presto systems are all SQL-like real-time big data query and analysis engines, but their technical focuses are completely different. And they weren't born to replace Hive, which is very valuable as a data warehouse. These four systems and Hive are both data query tools built on Hadoop, each with different emphasis on adaptation, but from the perspective of client use, they have a lot in common with Hive, such as data table metadata, Thrift Interface, ODBC/JDBC driver, SQL syntax, flexible file format, storage resource pool, etc. The relationship between Hive and Impala, Shark, Stinger, and Presto in Hadoop is shown in Figure 2. Hive is suitable for long-term batch query analysis, while Impala, Shark, Stinger, and Presto are suitable for real-time interactive SQL queries. They provide data analysts with big data analysis tools for rapid experimentation and verification of ideas. You can use Hive first for data transformation processing, and then use one of these four systems to perform fast data analysis on the resulting Hive-processed dataset. Below, we briefly introduce Hive, Impala, Shark, Stinger, and Presto from the problem domain:

1) Hive, Map-Reduce in the cloak of SQL. Hive encapsulates a layer of SQL outside for the convenience of users to use Map-Reduce. Since Hive adopts SQL, its problem domain is narrower than that of Map-Reduce. Because of many problems, SQL cannot express, such as some data mining algorithms, Recommendation algorithms, image recognition algorithms, etc., these can still only be done by writing Map-Reduce.

2) Impala: The open source implementation of Google Dremel (similar to Apache Drill), because of the interactive real-time computing requirements, Cloudera launched the Impala system, which is suitable for interactive real-time processing scenarios and requires a small amount of data generated at the end.

3) Shark/Spark: In order to improve the computational efficiency of Map-Reduce , Berkeley's AMPLab laboratory developed Spark, which can be regarded as a memory-based Map-Reduce implementation. In addition, Berkeley also encapsulates a layer of SQL on the basis of Spark. A new Hive-like system, Shark, was created.

4) Stinger Initiative (Tez optimized Hive): Hortonworks has open sourced a DAG computing framework Tez, which can be understood as an open source implementation of Google Pregel. This framework can be used to design DAG applications like Map-Reduce, but it needs to be noted Yes, Tez can only run on YARN. An important application of Tez is to optimize the typical DAG application scenarios such as Hive and PIG. It reduces the data read and write IO and optimizes the DAG process, making Hive many times faster.

5) Presto: FaceBook open-sourced Presto in November 2013 , a distributed SQL query engine designed for high-speed, real-time data analysis. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto has designed a simple abstraction layer for data storage to satisfy queries using SQL on top of different data storage systems (including HBase, HDFS, Scribe, etc.).

Big DataFigure 2. The relationship between Hive and Impala, Shark, Stinger, and Presto in Hadoop

Current mainstream engine architecture

Hive

Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide complete SQL query functions. It can convert SQL statements into Map-Reduce tasks for operation, which is very suitable for data warehouse applications. Statistical Analysis. Its architecture is shown in Figure 3. Hadoop and Map-Reduce are the foundation of the Hive architecture. The Hive architecture includes the following components: CLI (Command Line Interface), JDBC/ODBC, Thrift Server, Meta Store and Driver (Complier, Optimizer and Executor).

Big Data

Figure 3. Hive architecture

Impala Architecture

Impala是Cloudera在受到Google的Dremel启发下开发的实时交互SQL大数据查询工具,它可以看成是Google Dremel架构和MPP (Massively Parallel Processing)结构的结合体。Impala没有再使用缓慢的Hive&Map-Reduce批处理,而是通过使用与商用并行关系数据库中 类似的分布式查询引擎(由Query Planner、Query Coordinator和Query Exec Engine三部分组成),可以直接从HDFS或HBase中用SELECT、JOIN和统计函数查询数据,从而大大降低了延迟,其架构如图4所 示,Impala主要由Impalad,State Store和CLI组成。Impalad与DataNode运行在同一节点上,由Impalad进程表示,它接收客户端的查询请求(接收查询请求的 Impalad为Coordinator,Coordinator通过JNI调用java前端解释SQL查询语句,生成查询计划树,再通过调度器把执行计 划分发给具有相应数据的其它Impalad进行执行),读写数据,并行执行查询,并把结果通过网络流式的传送回给Coordinator,由 Coordinator返回给客户端。同时Impalad也与State Store保持连接,用于确定哪个Impalad是健康和可以接受新的工作。Impala State Store跟踪集群中的Impalad的健康状态及位置信息,由state-stored进程表示,它通过创建多个线程来处理Impalad的注册订阅和 与各Impalad保持心跳连接,各Impalad都会缓存一份State Store中的信息,当State Store离线后,因为Impalad有State Store的缓存仍然可以工作,但会因为有些Impalad失效了,而已缓存数据无法更新,导致把执行计划分配给了失效的Impalad,导致查询失败。 CLI提供给用户查询使用的命令行工具,同时Impala还提供了Hue,JDBC,ODBC,Thrift使用接口。

Big Data
图4. Impala架构

Shark架构

Shark是UC Berkeley AMPLAB开源的一款数据仓库产品,它完全兼容Hive的HQL语法,但与Hive不同的是,Hive的计算框架采用Map-Reduce,而 Shark采用Spark。所以,Hive是SQL on Map-Reduce,而Shark是Hive on Spark。其架构如图4所示,为了最大程度的保持和Hive的兼容性,Shark复用了Hive的大部分组件,如下所示:

1) SQL Parser&Plan generation: Shark完全兼容Hive的HQL语法,而且Shark使用了Hive的API来实现query Parsing和 query Plan generation,仅仅最后的Physical Plan execution阶段用Spark代替Hadoop Map-Reduce;

2) metastore:Shark采用和Hive一样的meta信息,Hive里创建的表用Shark可无缝访问;

3) SerDe: Shark的序列化机制以及数据类型与Hive完全一致;

4) UDF: Shark可重用Hive里的所有UDF。通过配置Shark参数,Shark可以自动在内存中缓存特定的RDD(Resilient Distributed Dataset),实现数据重用,进而加快特定数据集的检索。同时,Shark通过UDF用户自定义函数实现特定的数据分析学习算法,使得SQL数据查询 和运算分析能结合在一起,最大化RDD的重复使用;

5) Driver:Shark在Hive的CliDriver基础上进行了一个封装,生成一个SharkCliDriver,这是shark命令的入口;

6) ThriftServer:Shark在Hive的ThriftServer(支持JDBC/ODBC)基础上,做了一个封装,生成了一个SharkServer,也提供JDBC/ODBC服务。

Big Data

图5. Shark架构

Spark是UC Berkeley AMP lab所开源的类Hadoop Map-Reduce的通用的并行计算框架,Spark基于Map-Reduce算法实现的分布式计算,拥有Hadoop Map-Reduce所具有的优点;但不同于Map-Reduce的是Job中间输出和结果可以保存在内存中,从而不再需要读写HDFS,因此Spark 能更好地适用于数据挖掘与机器学习等需要迭代的Map-Reduce的算法。其架构如图6所示:

Big Data
图6. Spark架构

与Hadoop的对比,Spark的中间数据放到内存中,对于迭代运算效率更高,因此Spark适用于需要多次操作特定数据集的应用场合。需要反复操作的 次数越多,所需读取的数据量越大,受益越大,数据量小但是计算密集度较大的场合,受益就相对较小。Spark比Hadoop更通用,Spark提供的数据 集操作类型有很多种(map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy等),而Hadoop只提供了Map和Reduce两种操作。Spark可以直接对HDFS进行数据的读写,同样支持 Spark on YARN。Spark可以与Map-Reduce运行于同集群中,共享存储资源与计算,数据仓库Shark实现上借用Hive,几乎与Hive完全兼容。

Stinger架构

Stinger是Hortonworks开源的一个实时类SQL即时查询系统,声称可以提升较Hive 100倍的速度。与Hive不同的是,Stinger采用Tez。所以,Hive是SQL on Map-Reduce,而Stinger是Hive on Tez。Tez的一个重要作用是优化Hive和PIG这种典型的DAG应用场景,它通过减少数据读写IO,优化DAG流程使得Hive速度提供了很多倍。 其架构如图7所示, Stinger是在Hive的现有基础上加了一个优化层Tez(此框架是基于Yarn),所有的查询和统计都要经过它的优化层来处理,以减少不必要的工作 以及资源开销。虽然Stinger也对Hive进行了较多的优化与加强,Stinger总体性能还是依赖其子系统Tez的表现。而Tez是 Hortonworks开源的一个DAG计算框架,Tez可以理解为Google Pregel的开源实现,该框架可以像Map-Reduce一样,用来设计DAG应用程序,但需要注意的是,Tez只能运行在YARN上。

Big Data
图7. Stinger架构

Presto架构

2013年11月Facebook开源了一个分布式SQL查询引擎Presto,它被设计为用来专门进行高速、实时的数据分析。它支持标准的 ANSI SQL子集,包括复杂查询、聚合、连接和窗口函数。其简化的架构如图8所示,客户端将SQL查询发送到Presto的协调器。协调器会进行语法检查、分析 和规划查询计划。调度器将执行的管道组合在一起,将任务分配给那些里数据最近的节点,然后监控执行过程。客户端从输出段中将数据取出,这些数据是从更底层 的处理段中依次取出的。Presto的运行模型与Hive有着本质的区别。Hive将查询翻译成多阶段的Map-Reduce任务,一个接着一个地运行。 每一个任务从磁盘上读取输入数据并且将中间结果输出到磁盘上。然而Presto引擎没有使用Map-Reduce。它使用了一个定制的查询执行引擎和响应 操作符来支持SQL的语法。除了改进的调度算法之外,所有的数据处理都是在内存中进行的。不同的处理端通过网络组成处理的流水线。这样会避免不必要的磁盘 读写和额外的延迟。这种流水线式的执行模型会在同一时间运行多个数据处理段,一旦数据可用的时候就会将数据从一个处理段传入到下一个处理段。 这样的方式会大大的减少各种查询的端到端响应时间。同时,Presto设计了一个简单的数据存储抽象层,来满足在不同数据存储系统之上都可以使用SQL进 行查询。存储连接器目前支持除Hive/HDFS外,还支持HBase、Scribe和定制开发的系统。

Big Data

图8. Presto架构

性能评测总结

通过对Hive、Impala、Shark、Stinger和Presto的评测和分析,总结如下:

1) 列存储一般对查询性能提升明显,尤其是大表是一个包含很多列的表。例如,从Stinger(Hive 0.11 with ORCFile)VS Hive,以及Impala的Parquet VS Text file;

2) 绕开MR计算模型,省去中间结果的持久化和MR任务调度的延迟,会带来性能提升。例如,Impala,Shark,Presto要好于Hive和Stinger,但这种优势随着数据量增加和查询变复杂而减弱;

3) 使用MPP数据库技术对连接查询有帮助。例如,Impala在两表,多表连接查询中优势明显;

4) 充分利用缓存的系统在内存充足的情况下性能优势明显。例如,Shark,Impala在小数据量时性能优势明显;内存不足时性能下降严重,Shark会出现很多问题;

5) 数据倾斜会严重影响一些系统的性能。例如,Hive、Stinger、Shark对数据倾斜比较敏感,容易造成倾斜;Impala受这方面的影响似乎不大;

对于Hive、Impala、Shark、Stinger和Presto这五类开源的分析引擎,在大多数情况下,Imapla的综合性能是最稳定的,时间 性能也是最好的,而且其安装配置过程也相对容易。其他分别为Presto、Shark、Stinger和Hive。在内存足够和非Join操作情况 下,Shark的性能是最好的。

总结与展望

对大数据分析的项目来说,技术往往不是最关键的,关键在于谁的生态系统更强,技术上一时的领先并不足以保证项目的最终成功。对于Hive、 Impala、Shark、Stinger和Presto来讲,最后哪一款产品会成为事实上的标准还很难说,但我们唯一可以确定并坚信的一点是,大数据分 析将随着新技术的不断推陈出新而不断普及开来,这对用户永远都是一件幸事。举个例子,如果读者注意过下一代Hadoop(YARN)的发展的话就会发现, 其实YARN已经支持Map-Reduce之外的计算范式(例如Shark,Impala等),因此将来Hadoop将可能作为一个兼容并包的大平台存 在,在其上提供各种各样的数据处理技术,有应对秒量级查询的,有应对大数据批处理的,各种功能应有尽有,满足用户各方面的需求。

In addition to open source solutions such as Hive, Impala, Shark, Stinger, and Presto, traditional vendors such as Oracle and EMC are not sitting still and waiting for their market to be swallowed by open source software. For example, EMC has launched a HAWQ system, and claims that its performance is more than ten times faster than Impala, and Amazon's Redshift also provides better performance than Impala. Although open source software is extremely powerful because of its strong cost advantage, traditional database vendors will still try to launch products with stronger performance, stability, maintenance services and other indicators to compete with them in a differentiated way, and participate in open source at the same time. Community, leveraging open source software to enrich their product lines, enhance their competitiveness, and meet certain consumer needs through more high-value-added services. After all, these manufacturers have often accumulated a lot of technology and experience in traditional fields such as parallel databases, and these backgrounds are still very deep. In general, the big data analysis technology in the future will become more mature, cheaper, and easier to use; accordingly, it will be easier and more convenient for users to mine data from their own big data. Valuable business information.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326219452&siteId=291194637