Big data-Spark SQL performance optimization

Spark SQL performance optimization

1. The data of the cache table in memory

scala code
spark-shell --master spark://hadoop1:7077 --jars /root/temp/mysql-connector-java-8.0.13.jar --driver-class-path /root/temp/mysql-connector-java-8.0.13.jar
val mysql = spark.read.format("jdbc").option("user","destiny").option("password","destiny").option("url","jdbc:mysql://192.168.138.1:3306/data?serverTimezone=GMT%2B8").option("dbtable","employee").load
mysql.createOrReplaceTempView("employee")
spark.sqlContext.cacheTable("employee")
spark.sql("select * from employee").show

result
Insert picture description here

Second, performance optimization related parameters

(1) Optimized parameters for data caching in memory

spark.sql.inMemoryColumnarStorage,compressed

Default is true

Spark SQL will automatically select a Yasso encoding method for each column based on statistical information

spark.sql.inMemoryColumnarStorage.batchSize

The default is 10000

Cache batch size. When caching data, a larger batch size can improve memory utilization and compression ratio, but it also brings the risk of COM (out of memory)

(2) Other performance-related configuration options (manual modification is not recommended)

spark.sql.files.maxPartitionBytes

128MB by default

The maximum number of bytes that a single partition can hold when reading a file

spark.sql.files.openCostInBytes

The default is 4M

The estimated cost of opening a file is measured in terms of the number of bytes that can be scanned at the same time. It is used when writing multiple files to a partition. Overestimation is better, so small file partitions will be faster than large file partitions

spark.sql.autoBroadcastJoinThreshold

The default is 10M

Used to configure the maximum byte size that a table can broadcast to all worker nodes when performing a join operation. You can disable broadcasting by setting this value to -1. Note that current data statistics only support Hive Metastore tables that have run the ANALYZE TABLE COMPUTE STATISTICE noscan command

spark.sql.shuffle.partitions

The default is 200

Used to configure the number of partitions used when shuffling data in join or aggregation operations

Published 131 original articles · won 12 · 60,000 views +

Guess you like

Origin blog.csdn.net/JavaDestiny/article/details/97493636