Comparison of SQL-on-Hadoop Implementation Schemes

Hive

Built on top of Hadoop Distributed File System (HDFS) and MapReduce.

Provides the HiveQL language, allowing users to make SQL-like queries.

It is an old-fashioned Hadoop data warehouse product. It encapsulates a SQL semantic layer on the MapReduce computing framework to simplify the development of MR.

advantage:

Simplify the development of MR programs, with the best stability

shortcoming:

The speed is slow, suitable for batch application scenarios in the background, not suitable for interactive real-time query and online analysis.

 

Spark SQL

Another SQL engine for Hadoop, with Spark as the underlying computing framework, is a subset of Scala.

Spark SQL is Spark's program module for processing structured data.

Structured data can be queried as Spark RDDs (Resilient Distributed Datasets)
provides a restricted form of distributed shared memory

Since RDDs are read-only, they can only be created, transformed, and evaluated

advantage:

Shared memory computing is several times faster than Hive in performance. This memory operation reduces disk IO and greatly improves computing speed.

shortcoming:

The original intention of developing Spark is for training algorithms of machine learning systems, not for read-only operations and SQL queries, so the performance is worse

It takes 10 minutes for Spark to do SQL queries on a 20TB dataset, which is 3 times faster than Hive, but it still cannot support interactive queries and OLAP applications, and it takes up a lot of memory and is easy to oom

 

Impala

An MPP (Massively Parallel Processing) query engine running on Hadoop, providing high-performance, low-latency SQL queries on Hadoop cluster data, using HDFS as the underlying storage.

Impala can share database tables with Hive and is compatible with HiveSQL syntax

Impala can return query results in seconds

shortcoming:

Poor ease of use, functionally does not support update, delete, Date, n_numeric, collect_set, xml, Json related functions, does not support rollup, cube, grouping set and other operations, does not support data sampling (sampling), does not support ORC file format etc

 

HAWQ

It is a Hadoop native MPP (massively parallel processing) SQL analysis engine, aimed at analytical applications. It is similar to other relational databases, receiving SQL and returning result sets.

The HAWQ engine is built using Greenplum Data Warehouse's code base and deep data management expertise to store the underlying data in HDFS.

Similar to Impala, it also adopts the MPP architecture, enabling users to analyze and query performance based on MPP, while effectively utilizing the distributed storage, fault tolerance mechanism, rack awareness and other functions of HDFS, taking into account low latency and high expansion.

Can co-exist with other traditional SQL-on-Hadoop engines in an analytical heap.

External data sources such as HDFS, Hive, HBase, JSON can be accessed through PXF, fully compatible with SQL standards, and SQL UDF can be written to complete simple data mining and machine learning.

Features:

  1. Fully compatible with the SQL standard
  2. rich function
  3. TPC-DS Compliance (Query Template)
  4. Partition Table
  5. procedural programming
  6. Native Hadoop file format support
  7. External data integration

performance:

  1. Cost-Based SQL Query Optimizer
  2. Dynamic Pipelining
  3. Compared with Impala performance, it is improved by 4.55 times
  4. Compared with Hive performance, it is improved by 4-50 times

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324897164&siteId=291194637