Spark SQL performance optimization
1. The data of the cache table in memory
scala codespark-shell --master spark://hadoop1:7077 --jars /root/temp/mysql-connector-java-8.0.13.jar --driver-class-path /root/temp/mysql-connector-java-8.0.13.jar
val mysql = spark.read.format("jdbc").option("user","destiny").option("password","destiny").option("url","jdbc:mysql://192.168.138.1:3306/data?serverTimezone=GMT%2B8").option("dbtable","employee").load
mysql.createOrReplaceTempView("employee")
spark.sqlContext.cacheTable("employee")
spark.sql("select * from employee").show
result
Second, performance optimization related parameters
(1) Optimized parameters for data caching in memory
spark.sql.inMemoryColumnarStorage,compressed
Default is true
Spark SQL will automatically select a Yasso encoding method for each column based on statistical information
spark.sql.inMemoryColumnarStorage.batchSize
The default is 10000
Cache batch size. When caching data, a larger batch size can improve memory utilization and compression ratio, but it also brings the risk of COM (out of memory)
(2) Other performance-related configuration options (manual modification is not recommended)
spark.sql.files.maxPartitionBytes
128MB by default
The maximum number of bytes that a single partition can hold when reading a file
spark.sql.files.openCostInBytes
The default is 4M
The estimated cost of opening a file is measured in terms of the number of bytes that can be scanned at the same time. It is used when writing multiple files to a partition. Overestimation is better, so small file partitions will be faster than large file partitions
spark.sql.autoBroadcastJoinThreshold
The default is 10M
Used to configure the maximum byte size that a table can broadcast to all worker nodes when performing a join operation. You can disable broadcasting by setting this value to -1. Note that current data statistics only support Hive Metastore tables that have run the ANALYZE TABLE COMPUTE STATISTICE noscan command
spark.sql.shuffle.partitions
The default is 200
Used to configure the number of partitions used when shuffling data in join or aggregation operations