SparkContext 是Spark功能的主入口。一个SparkContext 代表一个spark集群的链接,可以用来在集群上创建RDD,累加器和广播变量。每个JVM中只能有一个活动的SparkContext。必须在创建新的SparkContext之前调用 stop()方法来停止当前处于active状态的SparkContext。这个限制最终可能会被移除。
下面我们来看看我们使用 val sc = new SparkContext(sparkConf) 创建SparkContext时spark干了哪些事:
首先调用this(new SparkConf)构造函数初始化一个 SparkContext:
/**
* Create a SparkContext that loads settings from system properties (for instance, when
* launching with ./bin/spark-submit).
*/
def this() = this(new SparkConf())
下面我们来看看SparkContext初始化时都干了哪些事:
class SparkContext(config: SparkConf) extends Logging {
// The call site where this SparkContext was constructed.
private val creationSite: CallSite = Utils.getCallSite()
// If true, log warnings instead of throwing exceptions when multiple SparkContexts are active
private val allowMultipleContexts: Boolean =
config.getBoolean("spark.driver.allowMultipleContexts", false)
// In order to prevent multiple SparkContexts from being active at the same time, mark this
// context as having started construction.
// NOTE: this must be placed at the beginning of the SparkContext constructor.
SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
val startTime = System.currentTimeMillis()
private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)
1. 创建变量creationSite: CallSite,查看是哪些user代码在调用SparkContext中的方法。
2. 创建变量allowMultipleContexts:Boolean,是否允许多个SparkContext同时处于活动状态。默认false,如果设置为true,在检测到多个活动的SparkContext时会日志警告,而不是抛出异常。
3. SparkContext.markPartiallyConstructed(this, allowMultipleContexts) 标记当前SparkContext为已开始构建状态,防止多个SparkContext同时处于活动状态(该操作需要放在构造函数的最前面)。
4. 创建常量startTime。
5. 创建常量stopped,标记SparkContext是否已经停止。
// log out Spark Version in Spark driver log
logInfo(s"Running Spark version $SPARK_VERSION")
/* ------------------------------------------------------------------------------------- *
| Private variables. These variables keep the internal state of the context, and are |
| not accessible by the outside world. They're mutable since we want to initialize all |
| of them to some neutral value ahead of time, so that calling "stop()" while the |
| constructor is still running is safe. |
* ------------------------------------------------------------------------------------- */
private var _conf: SparkConf = _
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _listenerBus: LiveListenerBus = _
private var _env: SparkEnv = _
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
private var _statusStore: AppStatusStore = _
6. 打日志,显示当前运行的Spark版本
7. 声明一大堆SparkContext内部使用的变量:
7.1 _conf: sparkConf
7.2 _eventLogDir: 事件日志路径
7.3 _eventLogCodec: 事件日志编解码器
7.4 _listenerBus: LiveListenerBus,用来异步的传递SparkListenerEvents到已注册的SparkListeners。在LiveListenerBus的start()方法被调用之前,所有已经发布的事件都只会被缓存下来,启动后,这些事件才会被传递到附着的Listener那里。LiveListenerBus通过调用stop()方法结束,结束后,后续的时间都会被丢弃。LiveListenerBus在其内部实现了一个线程安全的ArrayList变体(CopyOnWriteArrayList),用于存放异步事件。
7.5 _env: Spark运行环境。
7.6 _statusTracker: 用来监控job和stage进度的低阶状态报告API
7.7 _progressBar:控制台stage进度条,如果有多个stage,显示的将是多个stage的综合进度。
7.8 _ui: spark程序的顶阶用户接口
7.9 _hadoopConfiguration:hadoop配置
7.10 _executorMemory: executor内存,单位MB,默认1024M
7.11 _schedulerBackend:调度系统的后后端接口,允许在TaskSchedulerImpl下植入多个。
7.12 _taskScheduler:低阶任务调度接口,当前只有一个实现类 org.apache.spark.scheduler.TaskSchedulerImpl(以后再细看)。
7.13 _heartbeatReceiver: 一个远程 RpcEndPoint的引用。RpcEndpointRef是线程安全的。
7.14 _dagScheduler: DAG调度器,一个实现了面向stage调度的高阶调度器。它会为每个job生成一个由stage构成的DAG,跟踪哪些RDD和stage的输出已经持久化,并找到运行这个job的最小调度。之后它会将stage以taskSet的形式提交给底层的TaskScheduler(很重要,以后再细看)。
7.15 _applicationId: spark程序的唯一标识符
7.16 _applicationAttemptId: spark程序唯一标符的 唯一尝试标识符
7.17 _eventLogger:一个将事件以日志的形式持久化的SparkListener
7.18 _executorAllocationManager: executor分配管理器,基于工作负载,动态分配或移除executor的代理。
7.19 _cleaner: RDD,shuffle和广播变量状态的异步清理器。
7.20 _listenerBusStarted:LiveListenerBus是否已经启动
7.21 _jars: 用户提交的jar包路径。多个路径以逗号分隔。
7.22 _files:用户提交的文件
7.23 _shutdownHookRef:关闭钩子的引用
7.24 _statusStore:一个Kv存储的包装类,它提供了获取API数据的方法。
// Used to store a URL for each static file/jar together with the file's local timestamp
private[spark] val addedFiles = new ConcurrentHashMap[String, Long]().asScala
private[spark] val addedJars = new ConcurrentHashMap[String, Long]().asScala
// Keeps track of all persisted RDDs
private[spark] val persistentRdds = {
val map: ConcurrentMap[Int, RDD[_]] = new MapMaker().weakValues().makeMap[Int, RDD[_]]()
map.asScala
}
8. 创建私有变量addedFiles:一个ConcurrentHashMap用于存储添加的文件。
9. 创建私有变量addedJars:一个ConcurrentHashMap用于存储添加的jar包。
10. 创建常量persistentRdds:一个ConcurrentHashMap,用于追踪所有已经缓存了的RDD
// Environment variables to pass to our executors.
private[spark] val executorEnvs = HashMap[String, String]()
// Set SPARK_USER for user who is running SparkContext.
val sparkUser = Utils.getCurrentUserName()
11. 创建executorEnvs:executor运行环境。实现了一个HashMap,用于存放传递给executor的环境变量。
12. 创建sparkUser:运行当前SparkContext的用户
private[spark] var checkpointDir: Option[String] = None
// Thread Local variable that can be used by users to pass information down the stack
protected[spark] val localProperties = new InheritableThreadLocal[Properties] {
override protected def childValue(parent: Properties): Properties = {
// Note: make a clone such that changes in the parent properties aren't reflected in
// the those of the children threads, which has confusing semantics (SPARK-10563).
SerializationUtils.clone(parent)
}
override protected def initialValue(): Properties = new Properties()
}
13. 声明checkpointDir:checkPoint路径
14. 创建localProperties:本地线程中可以被用户使用的属性
下面是重点!!!(注释写在代码中)
try {
// 克隆一份SparkConf
_conf = config.clone()
// 检查是否有非法或者过期的配置设置,如果有非法配置将抛出异常,过期配置将会被替换成当前支持的配置
_conf.validateSettings()
// 检查配置中是否配置了master,没有则抛出异常
if (!_conf.contains("spark.master")) {
throw new SparkException("A master URL must be set in your configuration")
}
// 检查配置中是否配置了app名称,没有则抛出异常
if (!_conf.contains("spark.app.name")) {
throw new SparkException("An application name must be set in your configuration")
}
// log out spark.app.name in the Spark driver logs
logInfo(s"Submitted application: $appName")
// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
// 当程序运行在yarn集群上时必须设置spark.yarn.app.id属性
if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
"Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
}
// 如果开启了配置日志,则打印出所有的配置信息
if (_conf.getBoolean("spark.logConf", false)) {
logInfo("Spark configuration:\n" + _conf.toDebugString)
}
// Set Spark driver host and port system properties. This explicitly sets the configuration
// instead of relying on the default value of the config constant.
// 设置driver主机地址
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
// 如果driver端口未配置,则设置为0
_conf.setIfMissing("spark.driver.port", "0")
// 设置executor ID
_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
// 加载用户提交的jar包
_jars = Utils.getUserJars(_conf)
// 加载用户提交的文件
_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
.toSeq.flatten
// 如果启动了时间日志,则配置事件日志路径
_eventLogDir =
if (isEventLogEnabled) {
val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
.stripSuffix("/")
Some(Utils.resolveURI(unresolvedDir))
} else {
None
}
// 初始化事件日志编解码器
_eventLogCodec = {
val compress = _conf.getBoolean("spark.eventLog.compress", false)
if (compress && isEventLogEnabled) {
Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
} else {
None
}
}
// 初始化ListenerBus,一个用于传递监听到的事件到监听器的Map
_listenerBus = new LiveListenerBus(_conf)
// Initialize the app status store and listener before SparkEnv is created so that it gets
// all events.
// 在SparkEnv创建前初始化app状态和app监听器,这样这个statusStore可以获取所有的事件
_statusStore = AppStatusStore.createLiveStore(conf)
listenerBus.addToStatusQueue(_statusStore.listener.get)
// Create the Spark execution environment (cache, map output tracker, etc)
// 创建Spark运行环境
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
下面我们重点看一下 _env = createSparkEnv(_conf, isLocal, listenerBus)
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
// 实际创建的是Driver环境
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf))
}
可以看到实际调用的是SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf)),创建的是driverEnv。关于发driverEnv的底层实现我们以后再讲。
现在我们回到SparkContext。
// If running the REPL, register the repl's output dir with the file server.
// 如果运行了REPL,则注册repl的输出路径到文件服务器
_conf.getOption("spark.repl.class.outputDir").foreach { path =>
val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
_conf.set("spark.repl.class.uri", replUri)
}
// 初始化妆台追踪器
_statusTracker = new SparkStatusTracker(this, _statusStore)
// 初始化进度条
_progressBar =
if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
} else {
None
}
// 初始化SparkUI界面接口
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
startTime))
} else {
// For tests, do not enable the UI
None
}
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
// 绑定ui
_ui.foreach(_.bind())
// 加载hadoop运行环境
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
// Add each JAR given through the constructor
// 像task中添加jar依赖
if (jars != null) {
jars.foreach(addJar)
}
// 添加每个node需要下载文件
if (files != null) {
files.foreach(addFile)
}
// 初始化executor内存,默认1024MB
_executorMemory = _conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
.orElse(Option(System.getenv("SPARK_MEM"))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
.getOrElse(1024)
// Convert java options to env vars as a work around
// since we can't set env vars directly in sbt.
// 将java选项转换成环境变量
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
executorEnvs("SPARK_PREPEND_CLASSES") = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
// 初始化心跳接收器,用于接收心跳
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
// Create and start the scheduler
// 初始化作业调度器、DAG调度器
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
// 在启动任务调度器
_taskScheduler.start()
这里初始化了一堆信息,重要的有ui,_hadoopConfiguration,executorMemory,以及TaskScheduler.这里的createTaskScheduler会根据提供的master URL创建不同的TaskScheduler,返回一个包含scheduleBackend 和 taskScheduler的Tuple.
// 设置应用程序id
_applicationId = _taskScheduler.applicationId()
// 设置应用程序尝试id
_applicationAttemptId = taskScheduler.applicationAttemptId()
// 程序id保存到sparkConf
_conf.set("spark.app.id", _applicationId)
// 设置ui代理
if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
}
// 将应用id设置到ui
_ui.foreach(_.setAppId(_applicationId))
// 初始化blockManager
_env.blockManager.initialize(_applicationId)
// The metrics system for Driver need to be set spark.app.id to app ID.
// So it should start after we get app ID from the task scheduler and set spark.app.id.
// 启动指标系统
_env.metricsSystem.start()
// Attach the driver metrics servlet handler to the web ui after the metrics system is started.
_env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))
// 设置事件日志监听器,并添加到listenerBus
_eventLogger =
if (isEventLogEnabled) {
val logger =
new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
_conf, _hadoopConfiguration)
logger.start()
listenerBus.addToEventLogQueue(logger)
Some(logger)
} else {
None
}
// Optionally scale number of executors dynamically based on workload. Exposed for testing.
// 是否启用动态分配
val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
// 初始化executor动态分配管理器
_executorAllocationManager =
if (dynamicAllocationEnabled) {
schedulerBackend match {
case b: ExecutorAllocationClient =>
Some(new ExecutorAllocationManager(
schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf,
_env.blockManager.master))
case _ =>
None
}
} else {
None
}
// 启动executor动态分配管理器
_executorAllocationManager.foreach(_.start())
// 初始化并启动状态清除器
_cleaner =
if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
Some(new ContextCleaner(this))
} else {
None
}
_cleaner.foreach(_.start())
// 设置并启动ListenerBus(主要是额外的listener)
setupAndStartListenerBus()
// 发布环境更新事件
postEnvironmentUpdate()
// 发布程序启动事件
postApplicationStart()
// Post init
_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
_executorAllocationManager.foreach { e =>
_env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}
// Make sure the context is stopped if the user forgets about it. This avoids leaving
// unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
// is killed, though.
logDebug("Adding shutdown hook") // force eager creation of logger
_shutdownHookRef = ShutdownHookManager.addShutdownHook(
ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
try {
stop()
} catch {
case e: Throwable =>
logWarning("Ignoring Exception while stopping SparkContext from shutdown hook", e)
}
}
} catch {
case NonFatal(e) =>
logError("Error initializing SparkContext.", e)
try {
stop()
} catch {
case NonFatal(inner) =>
logError("Error stopping SparkContext after init error.", inner)
} finally {
throw e
}
}
总的来说,SparkContext的初始化,创建了一大堆系统运行需要的变量,检测参数的,监听任务的,调度任务的等等。
后续我会继续学习SparkContext初始化时创建的几个在重量级的类:SparkEnv,TaskScheduler、DAGScheduler.
注:转载请注明出处。