Spark SQL in the working mechanism of Catalyst

 
Spark SQL in the working mechanism of Catalyst
A: Whether SQL, Hive SQL or DataFrame, Dataset triggered when Action Job, will become unresolved resolved through logic execution plan, and then use logic to perform unresolved metadata information on the calculation analysis, logic execution plan, and to logic execution plan optimization, logic execution plan after optimized, and then use logic execution plans optimized to generate multiple physical execution plan, were tested using the cost model for all of the physical implementation plan which look better performance, then choose the best performance of the physical execution plan, according to a selection code generation good physical execution plan, finally generated RDD chain starts execution and returns the result

 

 

 
RDD is: constant, distributed data sets in the cluster is partitioned, lazy calculated and is type-safe
RDD is the basis of Spark, Dataset and DataFrame will eventually call the API to implement RDD
DataFrame Row is the type of Dataset, and as RDD is constant, distributed data sets in the cluster is partitioned, lazy calculated, but it is not type-safe, does not provide similar RDD in the functional programming interfaces but DataFrame much stronger performance than the RDD
Dataset is strongly typed, functional support to become the DataFrame, saying that white is the RDD + DataFrame Dataset
 
This is because the Spark team used DataFrame or Dataset Schema information on DataFrame or the Dataset API do a lot of performance optimization, as follows:
1, or when the cache DataFrame Dataset to be basic types of memory column by column
2, tungsten plan: first, the introduction of a display memory manager lets Spark operating directly for the binary data instead of Java objects, so that you can reduce the cost and inefficiency of Java objects GC; second, designed more cache-friendly algorithms and data structures, so that the Spark application can spend less time waiting for the CPU to read data from memory, but also to the useful work provides more computing time; third, Code generation removed the original data the type of encapsulation and decapsulation, and more importantly, to avoid expensive polymorphic function scheduling
3, Catalyst Optimizer, because Spark of RDD is lazy loading, where before triggering Job can do a lot of optimization of the RDD chain, and Catalyst Optimizer is to give the RDD plus the convenience chain optimization means
 
 

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11959282.html