Spark SQL extends the Hive On Spark

 

A: difference Spark SQL and Hive On Spark's


Spark Spark SQL is a project of Spark.
Developed for a variety of data sources, including Hive, JSON, Parquet, JDBC, RDD and so can execute the query, based on a calculation engine query engine Spark

Hive On Spark is a project of Hive
does not pass MapReduce as a single query engine, but the Spark as the underlying query engine


Two: The basic principle of the Hive


Hive QL statement => 
parse => AST => 
generation logic execution plan => Operator Tree => 
optimized logic implementation plan => Optimized Operator Tree => 
generate physical implementation plan => Task Tree => 
optimized physical implementation plan => Optimized Task Tree => 
after performing optimization optimized Task Tree

Three: the calculation principle Hive On Spark


1, the Hive Spark RDD table as to operate

2, using Hive primitive
for some operations for RDD, such as groupByKey, sortByKey and so on. Spark does not use the transformation operations and primitives

3, new physical execution plan generation mechanism
using SparkCompiler logic implementation plan, namely Operator Tree, converted to Task Tree. Spark Task submitted for execution to the Spark.

4, SparkContext life cycle
Hive On Spark will provide each user session: implementation of a SQL statement, create a SparkContext

5, local and remote operation mode
Local: Spark Master to be local, such as local = spark.master SET
SparkContext client runs in a JVM,

Remote: The Spark Master is set to address the Master, then that remote mode
SparkContext starts in a remote JVM,

Communicate with a remote RPC by the JVM SparkContext

Four: the optimization points Hive On Spark


1, the Map the Join
the Spark to join is the default SQL support the use of broadcast mechanism to broadcast to a small table on each node to perform join the

Is being taken, it seems like the MapReduce Distributed Cache mechanism, namely to improve the HDFS replica factor of replication factors, to make the data has a backup on each compute node, so that data can be read locally

2, Cache Table
for some scenes need to perform multiple operations on a table inside the Hive On Spark optimized, going to the table cache multiple operations into memory, in order to improve performance

Five: RPC understand


RPC (Remote Procedure Call) - remote procedure call, which is a service request from a remote computer through a network, without having to understand the underlying network protocol technology

Step:
Runtime, a client RPC call to the server, roughly the following ten steps:
1. Call the client handle; performing transmission parameter
2. The system kernel calls the local message transmission network
3. The message to the remote host
4. The server handles get the message and the parameter acquisition
5. remote procedure

6. The process execution result returned handle server
7. The server returns the handle, call the remote system kernel
8. The local host message returned
9. Client handles the message received by the core
10. The client receives the data returned handle


Six: The remaining questions


What Hive that?
The default is based on the underlying MapReduce implementation?
After another out new SQL query engine. Including Spark SQL, Hive On Tez, Hive On Spark and so on.
The difference between the Hive and HBase:
What SQL query engine refers to?
What engines?
A detailed principles of SQL statements executed?
Environment to build! !
Hive architectural pattern child? ? ? - differs from traditional relational databases


 

Guess you like

Origin blog.csdn.net/weixin_39966065/article/details/93538139