Spark's SparkContext source code analysis

1. Introduction

  SparkContext is the main entry point of the Spark program and is used to connect with the Spark cluster. All operations of the Spark cluster are performed through the SparkContext, which can be used to create RDDs, counters, and broadcast variables on the Spark cluster. All Spark programs must create a SparkContext object. The StreamingContext used for streaming computing and the SQLContext used for SQL computing will also be associated with an existing SparkContext or implicitly create a SparkContext object. The source code is as follows:

 /** * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
* cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * * Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before * creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details. * * @param config a Spark Config object describing the application configuration. Any settings in * this config overrides the default configs as well as system properties.
*/ class SparkContext(config: SparkConf) extends Logging { // The call site where this SparkContext was constructed. private val creationSite: CallSite = Utils.getCallSite() // If true, log warnings instead of throwing exceptions when multiple SparkContexts are active private val allowMultipleContexts: Boolean = config.getBoolean("spark.driver.allowMultipleContexts", false) // In order to prevent multiple SparkContexts from being active at the same time, mark this // context as having started construction. // NOTE: this must be placed at the beginning of the SparkContext constructor. SparkContext.markPartiallyConstructed(this, allowMultipleContexts) val startTime = System.currentTimeMillis() private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false) private[spark] def assertNotStopped(): Unit = { if (stopped.get()) { val activeContext = SparkContext.activeContext.get() val activeCreationSite = if (activeContext == null) { "(No active SparkContext.)" } else { activeContext.creationSite.longForm } throw new IllegalStateException( s"""Cannot call methods on a stopped SparkContext. |This stopped SparkContext was created at: | |${creationSite.longForm} | |The currently active SparkContext was created at: | |$activeCreationSite """.stripMargin) } } def this() = this(new SparkConf()) def this(master: String, appName: String, conf: SparkConf) = this(SparkContext.updatedConf(conf, master, appName)) def this( master: String, appName: String, sparkHome: String = null, jars: Seq[String] = Nil, environment: Map[String, String] = Map()) = { this(SparkContext.updatedConf(new SparkConf(), master, appName, sparkHome, jars, environment)) }
private[spark] def this(master: String, appName: String) = this(master, appName, null, Nil, Map()) private[spark] def this(master: String, appName: String, sparkHome: String) = this(master, appName, sparkHome, Nil, Map())
private[spark] def this(master: String, appName: String, sparkHome: String, jars: Seq[String]) = this(master, appName, sparkHome, jars, Map()) // log out Spark Version in Spark driver log logInfo(s"Running Spark version $SPARK_VERSION")

Two. SparkConf configuration

  When initializing SparkContext, only one SparkConf configuration object is needed as a parameter. The definition of the SparkConf class to save the configuration is in the SparkConf.scala file in the same directory. Its main member is a hash table, where the types of Key and value are both string types:

  

  Although SparkConf provides some simple interfaces for configuration, in fact all configurations are stored in the settings in the form of <key, value> pairs. For example, setting the master method is to set the configuration item spark.master.

  

  Therefore, you can use the spark.master configuration item in the configuration file, or the --master option in the parameter list, or the setMaster () method to set the Master address, but their priorities are different.

  Only one SparkContext is allowed to be started per JVM, otherwise an exception will be thrown by default. For example, in an interactive programming environment started through spark-shell, a SparkContext object named sc has been created by default. Will report an error. A simple solution is to stop the default sc first.

  Of course, you can also ignore this error by setting spark.driver.allowMultipleContext to true, as follows:

  

Three. Initialization process

  During the construction of SparkContext, all services have been started. Because of the characteristics of Scala syntax, all constructors will call the default constructor, and the code of the default constructor is directly in the class definition. In addition to initializing various configurations and logs, one of the most important initialization operations is to start the Task scheduler and DAG scheduler. The code is as follows:

  

  The difference between DAG scheduling and Task scheduling is that DAG is a high-level scheduling, draws a directed acyclic graph for each Job, tracks the output of each Stage, calculates the shortest path to complete the Job, and submits the Task to the Task scheduler To execute. The Task scheduler is only responsible for accepting the request of the DAG scheduler and is responsible for the actual scheduling execution of the Task, so the initialization of the DAGScheduler must be after the Task scheduler.

  The advantage of the separate design of DAG and Task is that Spark can flexibly design its own DAG scheduling, and can also be combined with other resource scheduling systems, such as YARN and Mesos.

  The task scheduler itself is created in the createTaskScheduler function. According to the different modes specified when the Spark program is submitted, different types of schedulers can be started. And for the sake of fault tolerance, createTaskScheduler will return two schedulers with one master and one backup. Taking the YARN cluster mode as an example, the master and standby schedulers correspond to different types of instances, but the same configuration is loaded. code show as below:

case masterUrl =>
        val cm = getClusterManager(masterUrl) match {
          case Some(clusterMgr) => clusterMgr
          case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
        }
        try {
          val scheduler = cm.createTaskScheduler(sc, masterUrl)
          val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
          cm.initialize(scheduler, backend)
          (backend, scheduler)
        } catch {
          case se: SparkException => throw se
          case NonFatal(e) =>
            throw new SparkException("External scheduler cannot be instantiated", e)
        }
}

4. Other functional interfaces

  In addition to initializing the environment and connecting to the Spark cluster, SparkContext also provides many functional entrances, as follows:

  1. Create an RDD. All methods for creating RDDs are defined in SparkContext, such as parallelize and textFilenewAPIHadoopFile.

  2. RDD persistence. The persistent operation methods of RDD persistRDD and unpersistRDD are also defined in SparkContext.

  3. Create shared variables. Including counters and broadcast variables.

  4.stop (). Stop SparkContext.

  5.runJob. Submit the RDD Action operation, which is the entry point for all scheduled executions.

Guess you like

Origin www.cnblogs.com/yszd/p/12696690.html