(D) SparkSQL study notes

Spark on hive

Development environment configuration

  1. The hive_home / conf is copied to the hive-site.xml spark_home / conf;
  2. The hdfs-site.xml and core-site.xml in hadoop_home / etc / hadoop copied into spark_home / conf;
  3. On the copy to spark_home / conf start node where the spark-sql in local mode;
  4. If the hive is metastore mysql database needs to be driven into mysql spark_home / jars directory;

Development Environment : Create a folder conf in the project, the above-mentioned three documents into the conf directory; if the hive is metastore mysql database, mysql driven by the need to put the project classpath.

val spark = SparkSession
              .builder()
              .master("local[*]")
              .appName("Spark Hive Example")
              .enableHiveSupport()//启用对hive的支持
              .getOrCreate()

<!--如果版本高于1.2.1,设置hive-site.xml中的属性,避免报错:-->
<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
</property>

SparkSQL execution

SparkSQL execution

  1. Edit Dataset API SQL Code;
  2. If the code compiles without error, Spark will be the code into logical plan;
  3. Spark program will be converted to a physical logic program will optimize the code (Catalyst optimizer);
  4. Spark physical execution plan (RDD).

Logic program (Logical plan)

Logic plans do not involve Executor and Driver, users simply write the code into the best version, by user code into unresolved logic plan, and then converted into resolvd logic plan, catalog (all repository tables and DataFrame information), then the optimizer will plan to catalyst, catalyst optimization is a set of optimized set of rules: predicate pushdown projection.
Here Insert Picture Description

Physical plan

Optimal logical plan by generating different physical execution strategies (ABC plan), these physical implementation plan will be comparing the cost model, in order to be selected from an optimal physical execution plan, the result is a series of RDD and transformation.
Here Insert Picture Description

carried out

Selecting a physical execution plan, run all the RDD code, tungsten further optimized to generate a local Java bytecode, various Stages perform generation, and finally returns the results to the user.

Guess you like

Origin blog.csdn.net/dec_sun/article/details/89819558