Spark SQL analysis of the working principle and performance optimization

table of Contents

 

A: Spark Sql execution flow

Two: Spark SQL optimization of

Three: the problem to be solved


A: Spark Sql execution flow

1, SqlParse (SQL statements)
MySQL, the Oracle, Hive, etc., are the first execution plan a SQL statement

2, Analyser (analyzer)
after parsing logic program:
example: from table students => filter =  > select name
similar Tree structure

3, Optimizer (optimizer)
conventional Oracle databases, etc., typically generates a plurality of execution plan, the optimizer will choose from the best
example:
SELECT name from (SELECT from ...) = ... WHERE .. . 
found that where the same effect on subquery inside, and reduces the amount of queries, increasing efficiency = "this is the optimization

4, SparkPlan (spark scheduled task)
physical planning: what to read data from any file, from which several file association, etc.

image:


Two: Spark SQL optimization of


1, the degree of parallelism in the process of setting Shuffle: spark.sql.shuffle.partitions (SQLContext.setConf ())
2, the Hive data warehouse construction process, a reasonable set of data types, such as can be set to INT, do not set BIGINT. Reduce unnecessary memory overhead caused by data type.
3, when writing SQL, try to give a clear column names, such as select name from students. Do not write select * way.
4, parallel processing of query results: the results Spark SQL query, if a large amount of data, such as more than 1000, then do not disposable collect () Driver to reprocessing. Using the foreach () operator, parallel processing of the query results.
5, cache table: For a SQL statement that may be used more than once to the table, can be cached, use SQLContext.cacheTable (tableName), or DataFrame.cache () can be. Spark SQL will cache table column with memory storage format. Then Spark SQL columns can be scanned only you need to use, and automatically optimize compression, to minimize memory usage and overhead GC. SQLContext.uncacheTable (tableName) may be removed from the cache table. With SQLContext.setConf (), provided spark.sql.inMemoryColumnarStorage.batchSize parameters (default 10000), may be configured to store the unit column.
6, broadcast join table: spark.sql.autoBroadcastJoinThreshold, default 10485760 (10 MB). In the case of enough memory can be increased in size, it takes up when a parameter table in the join, much less than the maximum, can be broadcast to optimize performance.
7, tungsten plan: spark.sql.tungsten.enabled, default is true, automatic memory management.

Three: the problem to be solved

1. The degree of parallelism reason, it is possible to enhance the efficiency, IO, data transmission reaffirms

2. clear memory, cache
  relationship of physical memory, why can optimize? ?

3. The role of broadcasting?

4. What is the unresolved logical plan, what is the logic program include? ?

5.Spark SQL optimization, what is the difference with the ordinary Spark program is?

Guess you like

Origin blog.csdn.net/weixin_39966065/article/details/93419728