spark-hive on spark

overall design

The overall design idea of ​​Hive on Spark is to reuse the functions at the logical level of Hive as much as possible; starting from the generation of physical plans, a complete set of Spark-specific implementations are provided, such as SparkCompiler, SparkTask, etc., so that Hive queries can be executed as Spark tasks . Here are a few key design principles.

Minimize modifications to Hive's original code as much as possible. This is the biggest difference from the previous Shark design ideas. Shark's changes to Hive are too big to be accepted by the Hive community. Hive on Spark changes Hive's code as little as possible, so as not to affect Hive's current support for MapReduce and Tez. At the same time, Hive on Spark guarantees that there will be no impact on the functionality and performance of existing MapReduce and Tez schemas.
For users who choose Spark, it should be able to automatically obtain the existing and future functions of Hive.
Keep maintenance costs as low as possible and keep loose coupling of Spark dependencies.

 

 

Using Hive primitives

This mainly refers to the use of Hive operators to process data. Spark provides a series of transformations (Transformation) for RDD, some of which are also SQL-oriented, such as groupByKey, join, etc. But using these transformations (as Shark did) means that we have to re-implement some of the features that Hive already has; and when Hive adds new features, we need to modify the Hive on Spark schema accordingly. In view of this, we choose to wrap Hive's operators as Functions and then apply them to RDDs. In this way, we only need to rely on the transformation of few RDDs, and the main calculation logic is still provided by Hive.

 

ref:

Intel Li Rui: Analysis of Hive on Spark

 http://www.aboutyun.com/thread-12334-1-1.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326849680&siteId=291194637