Hive: The hive ql similar sql language, his bottom is directly sql statement directly into MapReduce jobs, sql == "MapReduce
So his execution speed is slow: Hive on MapReduce Features: Slow
Improved: Hive on tez, Hive on Spark, is to solve the problem of calculation speed mapReduce slow.
Spark: hive on spark ==> shark (hive on spark), in fact, the role of shark is translated into the language ql RDD to operate.
Shark launched
Pros: very popular, spark-based, memory-based columnar storage, is compatible with the hive.
Disadvantage: hive ql parsing, generating logic execution plan, the execution plan optimization is dependent on the hive, just replace the physical execution plan from mr spark job to job.
termination:
After Shark terminated, resulting in two branches:
1)Hive on Spark
In the Hive community, the source code is in the Hive. Hive development for many years, products mature.
2)Spark Sql
Spark community, the source code is in the Spark, developed in recent years, is to optimize away Shark dependence on the hive, and supports a variety of data sources, a variety of optimization technology, scalability is much better
1)Hive:
facebook open source out the most primitive sql on solutions of hadoop.
The underlying principle:
a. The Sql ==> MapReduce (sql, converted to the MapReduce jobs)
. B metastore proposed a concept: metadata (ie, stored inside the hive what table, what table columns, each column of data is what type of information), table created inside the hive, which is the spark sql accessible , it is very smooth transition aspects.
c. In addition hive of sql sql relational database with a similar, he also has a database, table, view these concepts.
2)impala:
a, it is cloudera developed his products: cdh version of hadoop (this version a good solution hadoop version dependent), cm (providing service web interface installed hadoop ecosystem)
b, sql own daemon execution, non-run MapReduce's.
c, metastore also have this concept
3)presto
facebook open source, with Jingdong, sql
4) drill (fire in recent years)
sql
Frame data service can operate: hdfs, hive, rdbms, json, hbase, mangoodb, s3 or an external relational database
5) Spark SQL (In recent years, fire)
sql
dataframe/dataset api
metastore
Data Services framework can visit: hdfs, hive, rdbms, json, hbase, mangoodb, s3 or external relational database
Spark SQL detailed description
The SQL Spark IS for the Apache Spark apos Module1 Working with Structured Data. (Spark Spark the SQL is a module, he is handling structured data such as TXT, JSON, etc.)
Spark Sql not only have the ability to access and operate sql, there are other very rich operation: external data sources, optimization;
Spark Sql provides sql api is also provided DataFrame and the DataSet API.