Spark loading hadoop configuration principle

0x0 background

Recently, in order to change the five configuration files of hadoop&hive, namely:

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
hive-site.xml

Moved from the project (classpath) to the outside of the project (any location), studied the source code of the spark startup process, and recorded it here.

0x1 Hadoop and Hive get the default configuration process

Hadoop has a class.
Configuration implementsIterable<Map.Entry<String,String>>,Writable
This class is used to handle the configuration of hadoop. It has static code blocks inside:

static{
    //print deprecation warning if hadoop-site.xml is found in classpath
    ClassLoader cL = Thread.currentThread().getContextClassLoader();
    if (cL == null) {
      cL = Configuration.class.getClassLoader();
    }
    if(cL.getResource("hadoop-site.xml")!=null) {
      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +
          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "
          + "mapred-site.xml and hdfs-site.xml to override properties of " +
          "core-default.xml, mapred-default.xml and hdfs-default.xml " +
          "respectively");
    }
    addDefaultResource("core-default.xml");
    addDefaultResource("core-site.xml");
  }

It can be seen that when the Configuration is loaded, it will be read from the classpath

hadoop-site.xml
core-default.xml
core-site.xml

these three configuration files.
At the same time, the Configuration class has four subclasses:
write picture description here
they are:

HdfsConfiguration
HiveConf
JobConf
YarnConfiguration

Similar static code can also be seen inside these four classes, in
HdfsConfiguration:

static {
    addDeprecatedKeys();
    // adds the default resources
    Configuration.addDefaultResource("hdfs-default.xml");
    Configuration.addDefaultResource("hdfs-site.xml");
}

In YarnConfiguration:

static {
        addDeprecatedKeys();
        Configuration.addDefaultResource("yarn-default.xml");
        Configuration.addDefaultResource("yarn-site.xml");
        ...
}

In JobConf:

public static void loadResources() {
        addDeprecatedKeys();
        Configuration.addDefaultResource("mapred-default.xml");
        Configuration.addDefaultResource("mapred-site.xml");
        Configuration.addDefaultResource("yarn-default.xml");
        Configuration.addDefaultResource("yarn-site.xml");
}

But HiveConf does not read the configuration file in the static code block, but during the startup process of CarbonData, it will read hive-site.xml:

val hadoopConf = new Configuration()
val configFile = Utils.getContextOrSparkClassLoader.getResource("hive-site.xml")
if (configFile != null) {
    hadoopConf.addResource(configFile)
}

It can be seen that during the startup process of Hadoop, each component will first read the corresponding configuration file under the classpath.
We can also add configuration through Configuration set(String name, String value)or addResource(Path file)methods. The internal execution process of addResource is as follows:

    //将资源添加到resources列表(存储配置文件资源的列表)
    resources.add(resource);   // add to resources
    //将已有的属性清空
    properties = null;         // trigger reload
    finalParameters.clear();   // clear site-limits
    //重新加载所有配置
    loadResources(Properties properties,
                  ArrayList<Resource> resources,
                  boolean quiet)

0x2 Set Hadoop configuration during Spark startup

In the Spark Application startup process, a SparkContext must be started first. In fact, SparkContext can be understood as a set of configurations for Spark to run.

val sc = SparkContext.getOrCreate(sparkConf)

During the creation of SparkContext, a scheduling task is started to connect to the remote cluster:

val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)

If it is Spark on Yarn, the createSchedulerBackend method of YarnClusterManager will be called:

override def createSchedulerBackend(sc: SparkContext,
      masterURL: String,
      scheduler: TaskScheduler): SchedulerBackend = {
    sc.deployMode match {
      case "cluster" =>
        new YarnClusterSchedulerBackend(scheduler.asInstanceOf[TaskSchedulerImpl], sc)
      case "client" =>
        new YarnClientSchedulerBackend(scheduler.asInstanceOf[TaskSchedulerImpl], sc)
      case  _ =>
        throw new SparkException(s"Unknown deploy mode '${sc.deployMode}' for Yarn")
    }
  }

Then create YarnClient in YarnClientSchedulerBackend, you can see the constructor in Client:

private[spark] class Client(
    val args: ClientArguments,
    val hadoopConf: Configuration,
    val sparkConf: SparkConf)
  extends Logging {

  import Client._
  import YarnSparkHadoopUtil._

  def this(clientArgs: ClientArguments, spConf: SparkConf) =
    this(clientArgs, SparkHadoopUtil.get.newConfiguration(spConf), spConf)

  private val yarnClient = YarnClient.createYarnClient
  private val yarnConf = new YarnConfiguration(hadoopConf)

It can be seen that Spark will use the configuration in SparkConf to call the SparkHadoopUtil.get.newConfiguration(spConf) method to generate the corresponding Hadoop configuration.
Actually, in SparkContext, there are 2 member variables (essentially one):

private var _hadoopConfiguration: Configuration = _
def hadoopConfiguration: Configuration = _hadoopConfiguration
....
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

This _hadoopConfiguration also obtains the hadoop configuration through the SparkHadoopUtil.get.newConfiguration(_conf) method.
Enter the SparkHadoopUtil.get.newConfiguration(_conf) method, you can see:

     conf.getAll.foreach { case (key, value) =>
        if (key.startsWith("spark.hadoop.")) {
          hadoopConf.set(key.substring("spark.hadoop.".length), value)
        }
      }

That is to say, all properties starting with SparkConf spark.hadoop.will be converted to hadoop configuration.

Then we can parse the xml configuration file of hadoop, convert it into the corresponding key-value pair, and pass it to spark. code show as below:

    /**
     * 读取hadoopConfPath下所有hadoop相关配置文件,并转换为SparkConf
     *
     * @param hadoopConfPath hadoop配置文件所在的文件夹
     * @return 
     */
    public SparkConf getHadoopConf(String hadoopConfPath) {
        SparkConf hadoopConf = new SparkConf();

        try {
            Map<String, String> hiveConfMap = parseXMLToMap(hadoopConfPath + "/hive-site.xml");
            Map<String, String> hadoopConfMap = parseXMLToMap(hadoopConfPath + "/core-site.xml");
            hadoopConfMap.putAll(parseXMLToMap(hadoopConfPath + "/hdfs-site.xml"));
            hadoopConfMap.putAll(parseXMLToMap(hadoopConfPath + "/yarn-site.xml"));
            hadoopConfMap.putAll(parseXMLToMap(hadoopConfPath + "/mapred-site.xml"));

            for (Map.Entry<String, String> entry : hiveConfMap.entrySet()) {
                hadoopConf.set(entry.getKey(), entry.getValue());
            }
            for (Map.Entry<String, String> entry : hadoopConfMap.entrySet()) {
                hadoopConf.set("spark.hadoop." + entry.getKey(), entry.getValue());
            }
            return hadoopConf;
        } catch (DocumentException e) {
            logger.error("读取xml文件失败!");
            throw new RuntimeException(e);
        }

    }

    //将xml解析为HashMap
    private Map<String, String> parseXMLToMap(String xmlFilePath) throws DocumentException {
        Map<String, String> confMap = new HashMap<>();
        SAXReader reader = new SAXReader();
        Document document = reader.read(new File(xmlFilePath));
        Element configuration = document.getRootElement();
        Iterator iterator = configuration.elementIterator();
        while (iterator.hasNext()) {
            Element property = (Element) iterator.next();
            String name = property.element("name").getText();
            String value = property.element("value").getText();
            confMap.put(name, value);
        }
        return confMap;
    }

Note:
After testing, if the cluster has kerberos encryption, this method is invalid!
reasons may be:

class SparkHadoopUtil extends Logging {
      private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
      val conf: Configuration = newConfiguration(sparkConf)
      UserGroupInformation.setConfiguration(conf)

A new SparkConf is set in this class. This SparkConf only reads the properties at the beginning of spark from System.getProperty, so it is not the correct property, resulting in kerberos login exception.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325985618&siteId=291194637