spark sql overview and history

Hive: The hive ql similar sql language, his bottom is directly sql statement directly into MapReduce jobs, sql == "MapReduce

So his execution speed is slow: Hive on MapReduce Features: Slow

Improved: Hive on tez, Hive on Spark, is to solve the problem of calculation speed mapReduce slow.

Spark: hive on spark ==> shark (hive on spark), in fact, the role of shark is translated into the language ql RDD to operate.

Shark launched

Pros: very popular, spark-based, memory-based columnar storage, is compatible with the hive.

Disadvantage: hive ql parsing, generating logic execution plan, the execution plan optimization is dependent on the hive, just replace the physical execution plan from mr spark job to job.

termination:


13947662-c9f1fe508de33d58.png

After Shark terminated, resulting in two branches:

1)Hive on Spark

In the Hive community, the source code is in the Hive. Hive development for many years, products mature.

2)Spark Sql

Spark community, the source code is in the Spark, developed in recent years, is to optimize away Shark dependence on the hive, and supports a variety of data sources, a variety of optimization technology, scalability is much better


13947662-3cec6bae5bee7962.png

1)Hive:

facebook open source out the most primitive sql on solutions of hadoop.

The underlying principle:

a. The Sql ==> MapReduce (sql, converted to the MapReduce jobs)

. B metastore proposed a concept: metadata (ie, stored inside the hive what table, what table columns, each column of data is what type of information), table created inside the hive, which is the spark sql accessible , it is very smooth transition aspects.

c. In addition hive of sql sql relational database with a similar, he also has a database, table, view these concepts.

2)impala:

a, it is cloudera developed his products: cdh version of hadoop (this version a good solution hadoop version dependent), cm (providing service web interface installed hadoop ecosystem)

b, sql own daemon execution, non-run MapReduce's.

c, metastore also have this concept

3)presto

facebook open source, with Jingdong, sql

4) drill (fire in recent years)

sql

Frame data service can operate: hdfs, hive, rdbms, json, hbase, mangoodb, s3 or an external relational database

5) Spark SQL (In recent years, fire)

sql

dataframe/dataset api

metastore

Data Services framework can visit: hdfs, hive, rdbms, json, hbase, mangoodb, s3 or external relational database

Spark SQL detailed description


13947662-c9c9768f2ba9b571.png
Community activists, and the stable version


13947662-35323f8556d7ac85.png
You may operate sql / hive sql / udf, udafs and serdes


13947662-fe91089c46cb3537.png
Curry-party data can already access jdbc and odbc by


13947662-2d657ca47918ca59.png
Support for multiple languages ​​development

The SQL Spark  IS for the Apache Spark apos Module1 Working with Structured Data. (Spark Spark the SQL is a module, he is handling structured data such as TXT, JSON, etc.)

Spark Sql not only have the ability to access and operate sql, there are other very rich operation: external data sources, optimization;

Spark Sql provides sql api is also provided DataFrame and the DataSet API.


13947662-75991d7eac76c91b.png


13947662-b4f1f8f4ba874133.png
13947662-3cf64e2f3dc5ac19.png
13947662-89c3e7ecd24b194f.png
13947662-14d4d6ad0971df6b.png
DataFrame execute faster than RDD


13947662-22b2ce83f7c9e6d8.png
13947662-058e870717d60127.png
13947662-29015b2a743bf6f2.png
13947662-d31947be5c4a66cf.png
The optimization process catlist underlying spark sql


13947662-bff0f83269011047.png
13947662-417d03a06e873fea.png
DataFrame and Spark Sql optimization principle is the same


13947662-639e0ed61d2feb39.png

Guess you like

Origin blog.csdn.net/weixin_34088598/article/details/90970905