Spark code readability and performance optimization - ten sample (project structure)

Spark code readability and performance optimization - ten sample (project structure)

Foreword

Arrangements under the category of each package

  • Arrangements for the function of each package, you can easily view the project code structure, clear function, reduce the degree of favorable development of chaos
  • Here, reference for example a lift, as follows
    spark application structure
  • Explanation
    • app store applications developed for Spark
    • common used to store common configuration, or a function of engine
    • data.in acquisition function for storing the data source
    • data.out for storing data output
    • data.process for storing data processing functions
    • kyro Spark for each configuration corresponds to a registered kyro
    • util toolkit for

Spark is a template-based application design class

  • Design a template base class can run better process control code, to distinguish between code structure, improve readability, in favor of the latter part of the maintenance project
  • So, in the example cited here for reference a, as
    • AppTrait
      /**
       * Spark应用Trait
       * <p>
       * Date: 2018/3/2 9:49
       * @author ALion
       */
      trait AppTrait {
      
        /**
         * 初始化,应用运行前
         */
        protected def onInit(): Unit
      
        /**
         * 应用开始运行
         */
        protected def onRun(): Unit
      
        /**
         * 应用结束
         */
        protected def onStop(): Unit
      
        /**
         * 应用销毁后调用
         */
        protected def onDestroyed(): Unit
      
      }
      
    • BaseSparkApp
      import org.apache.spark.SparkConf
      import org.apache.spark.sql.SparkSession
      
      /**
       * Spark应用基类
       * <p>
       * Date: 2018/1/19 15:06
       *
       * @author ALion
       */
      abstract class BaseSparkApp extends AppTrait {
      
        protected final val appName = getClass.getSimpleName
      
        protected final var spark: SparkSession = _
      
        /**
         * 启动应用
         */
        final def startApp(): Unit = {
          val time1 = System.currentTimeMillis()
          println("-------> " + appName + " start ")
      
          onInit()
      
          createSession()
      
          onRun()
      
          onStop()
      
          val time2 = System.currentTimeMillis()
          println("-------> " + appName + " end costTime=" + (time2 - time1) / 1000 + "s")
        }
      
        /**
         * 手动停止应用
         */
        final def stopApp(): Unit = {
          onStop()
        }
      
        /**
         * 创建 SparkSession
         */
        private def createSession(): Unit = {
          spark = SparkSession.builder()
            .config(getConf)
            .enableHiveSupport()
            .getOrCreate()
        }
      
        /**
         * Spark应用配置
         *
         * @return SparkConf
         */
        protected def getConf: SparkConf = {
          new SparkConf()
            .setAppName(appName)
            .set("spark.network.timeout", "300")
            .set("spark.shuffle.io.retryWait", "30s")
            .set("spark.shuffle.io.maxRetries", "12")
            .set("spark.locality.wait", "9s")
        }
      
        /**
         * 初始化,应用运行前
         */
        override protected def onInit(): Unit = {}
      
        /**
         * 应用运行
         */
        override protected def onRun(): Unit
      
        /**
         * 应用结束
         */
        override protected def onStop(): Unit = {
          if (spark != null) spark.stop()
        }
      
        /**
         * 应用销毁后调用
         */
        override protected def onDestroyed(): Unit = {}
      
      }
      
  • explain
    • When Spark open applications, need to inherit from BaseSparkApp
    • Application describes a basic feature for AppTrait
    • BaseSparkApp is the actual template class for process control applications
    • The method startApp call to start the application
      • Flow control code, while the process is final, modify the code to prevent unauthorized operation successor process
      • Additional features such as time statistics here
    • onInit used to be called before the application starts
      • Before you create SparkSession call, override needed to use
      • Initialization code blocks can be stored, for example, data obtained from other sources Please initialize here. Programmers prevent accidental wrong code location, not after the restart Spark, but also take the time to execute other code, pointless waste of cluster resources.
    • createSession used to create SparkSession
      • Called after onInit
      • You need to be modified to create SparkConf, just a replication method to getConf
    • onRun real business processing code
      • After calling createSession
      • Provides global variable spark use, write your business logic code here
    • Spark application for automatically closing onStop
      • Automatic call after onRun
      • After the service, forced to close Spark (Spark sometimes get stuck because of some problems, or wait a while before shutting down, or your application in Spark There are other code to be time-consuming)
      • But also prevent you forget to call spark.stop ()
    • onDestroyed Application destruction
      • After calling onStop
      • Some businesses do not need to follow Spark environment, do not always take up the cluster resources
      • For example, business development, Spark application after running may be required to make a notification to other services, then it should do (that is, at onDestroyed) after stop
Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/103809483