Spark code readability and performance optimization - ten sample (project structure)
Foreword
Arrangements under the category of each package
Arrangements for the function of each package, you can easily view the project code structure, clear function, reduce the degree of favorable development of chaos
Here, reference for example a lift, as follows
Explanation
app store applications developed for Spark
common used to store common configuration, or a function of engine
data.in acquisition function for storing the data source
data.out for storing data output
data.process for storing data processing functions
kyro Spark for each configuration corresponds to a registered kyro
util toolkit for
Spark is a template-based application design class
Design a template base class can run better process control code, to distinguish between code structure, improve readability, in favor of the latter part of the maintenance project
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* Spark应用基类
* <p>
* Date: 2018/1/19 15:06
*
* @author ALion
*/
abstract class BaseSparkApp extends AppTrait {
protected final val appName = getClass.getSimpleName
protected final var spark: SparkSession = _
/**
* 启动应用
*/
final def startApp(): Unit = {
val time1 = System.currentTimeMillis()
println("-------> " + appName + " start ")
onInit()
createSession()
onRun()
onStop()
val time2 = System.currentTimeMillis()
println("-------> " + appName + " end costTime=" + (time2 - time1) / 1000 + "s")
}
/**
* 手动停止应用
*/
final def stopApp(): Unit = {
onStop()
}
/**
* 创建 SparkSession
*/
private def createSession(): Unit = {
spark = SparkSession.builder()
.config(getConf)
.enableHiveSupport()
.getOrCreate()
}
/**
* Spark应用配置
*
* @return SparkConf
*/
protected def getConf: SparkConf = {
new SparkConf()
.setAppName(appName)
.set("spark.network.timeout", "300")
.set("spark.shuffle.io.retryWait", "30s")
.set("spark.shuffle.io.maxRetries", "12")
.set("spark.locality.wait", "9s")
}
/**
* 初始化,应用运行前
*/
override protected def onInit(): Unit = {}
/**
* 应用运行
*/
override protected def onRun(): Unit
/**
* 应用结束
*/
override protected def onStop(): Unit = {
if (spark != null) spark.stop()
}
/**
* 应用销毁后调用
*/
override protected def onDestroyed(): Unit = {}
}
explain
When Spark open applications, need to inherit from BaseSparkApp
Application describes a basic feature for AppTrait
BaseSparkApp is the actual template class for process control applications
The method startApp call to start the application
Flow control code, while the process is final, modify the code to prevent unauthorized operation successor process
Additional features such as time statistics here
onInit used to be called before the application starts
Before you create SparkSession call, override needed to use
Initialization code blocks can be stored, for example, data obtained from other sources Please initialize here. Programmers prevent accidental wrong code location, not after the restart Spark, but also take the time to execute other code, pointless waste of cluster resources.
createSession used to create SparkSession
Called after onInit
You need to be modified to create SparkConf, just a replication method to getConf
onRun real business processing code
After calling createSession
Provides global variable spark use, write your business logic code here
Spark application for automatically closing onStop
Automatic call after onRun
After the service, forced to close Spark (Spark sometimes get stuck because of some problems, or wait a while before shutting down, or your application in Spark There are other code to be time-consuming)
But also prevent you forget to call spark.stop ()
onDestroyed Application destruction
After calling onStop
Some businesses do not need to follow Spark environment, do not always take up the cluster resources
For example, business development, Spark application after running may be required to make a notification to other services, then it should do (that is, at onDestroyed) after stop