1. Problem background

The big data platform uses the yarn client mode to submit spark tasks, and multiple offline Spark jobs share a Driver, which has the advantage of saving time for submitting tasks. But at the same time, it also increases the difficulty of operation and maintenance, because the task log is printed to the same file.

In order to distinguish the logs of each business process, the platform introduces log4j2 RoutingAppender, and the configuration is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="info">
   <Appenders>
       <Console name="std" target="SYSTEM_OUT">
        <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss,SSS}:%4p %t [%C:%L] - %m%n" />
       </Console>
       <Routing name="myAppender">
           <!--ctx:logfile表示从ThreadContext中获取logfile变量值;对应的写法还有sys:logfile,表示从环境变量中获取logfile变量值;显然前者的写法能够区分每个任务的日志-->
           <Routes pattern="${ctx:logfile}">
               <Route>
                   <File name="log-${ctx:logfile}" fileName="${ctx:logfile}">
                       <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss,SSS}:%4p %t [%C:%L] - %m%n" />
                   </File>
               </Route>
           </Routes>
           <IdlePurgePolicy timeToLive="5" timeUnit="minutes"/>
       </Routing>
   </Appenders>

   <Loggers>
       <Logger name="my" level="INFO" additivity="false">
           <AppenderRef ref="myAppender"/>
       </Logger>
       <Root level="INFO">
           <AppenderRef ref="std"/>
       </Root>
   </Loggers>
</Configuration>

Recently, when the data development department used the secondary development operator of the big data platform, it reported that the Logger object provided by the platform could not print logs. With curiosity, I studied the log framework used by the platform. In fact, it is normal that the Logger object provided by the platform cannot print logs on the executor side, because the above log4j2.xml file is not distributed to the executor side, and there is no Logger named my defined. So, how to print and collect the logs on the executor side?

Note:

This article is based on the Huawei fusioninsight platform. Under the fusioninsight platform client, the log4j-executor.properties file on the executor side is provided by default, and its content is as follows:

log4j.logger.org.sparkproject.jetty = WARN
log4j.appender.sparklog = org.apache.log4j.RollingFileAppender
log4j.rootCategory = INFO,sparklog
spark.executor.log.level = INFO
log4j.appender.sparklog.layout.ConversionPattern = %d{yyyy-MM-dd HH:mm:ss,SSS} | %-5p | [%t] | %m | %l%n
log4j.appender.sparklog.Append = true
log4j.appender.sparklog.layout = org.apache.log4j.PatternLayout
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper = INFO
log4j.appender.sparklog.MaxBackupIndex = 10
log4j.appender.sparklog.File = ${spark.yarn.app.container.log.dir}/stdout
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle = ERROR
log4j.logger.com.huawei.hadoop.dynalogger.DynaLog4jWatcher = OFF
log4j.appender.sparklog.MaxFileSize = 50MB
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter = INFO

So, how to print the logs on the executor side?

Two, the solution

2.1 Using scala.Console

The println provided by scala.Console has the same effect as System.out.println in java. The sample code looks like:

scala.Console.println(s"info zipName is : ${zipName}")
scala.Console.err.println(s"error zipName is : ${zipName}")

This is the way to output executor-side logs in the current big data platform.

Advantage

The logs are printed to stdout, and the yarn logs command can be used to aggregate the logs.
Simple, no additional class inheritance, no need to modify log4j configuration.

shortcoming

There is no log level. If some DEBUG logs are removed, it will be difficult to locate the problem when a problem occurs in the production environment. If it is not removed, it will cause too many logs on yarn, which seems to be more laborious.

2.2 Move the log4j2.xml logger configuration used by the driver to log4j-executor.properties

In the problem description section, we already know that the logger of the driver cannot print logs on the executor side because there is no corresponding logger in the log4j configuration file, so we try to modify the configuration of the driver side to log4j-executor.properties. As follows:

log4j.rootCategory = INFO,sparklog,my
log4j.appender.my = org.apache.log4j.RollingFileAppender
log4j.appender.my.layout.ConversionPattern = %d{yyyy-MM-dd HH:mm:ss,SSS} | %-5p | [%t] | %m | %l%n
log4j.appender.my.Append = true
log4j.appender.my.layout = org.apache.log4j.PatternLayout
log4j.appender.my.MaxBackupIndex = 10
log4j.appender.my.File = ${spark.yarn.app.container.log.dir}/stdout
log4j.appender.my.MaxFileSize = 50MB

Advantage

There are log levels, which can be modified when locating problems.

shortcoming:

Although the logger is written in imitation of sparklog, in the actual test process, it was found that ${spark.yarn.app.container.log.dir} did not complete the "macro parameter" replacement, causing the log printed by appender my to be directly placed in the application The startup directory, that is, under the startup path of the spark CoarseGrainedExecutorBackend process. The disadvantage of this is that when using the yarn logs command to aggregate logs on yarn, the aggregation cannot be performed, and it is inconvenient to locate the problem later. (When the container kills or exits, the log file is also deleted)

2.3 Inherit org.apache.spark.internal.Logging

This is the way Spark logs are printed internally, and the usage is as follows:

class Operator(@transient bean: Bean) extends Serializable with Logging
{
    dataFrame.foreachPartition(partition => {
      partition.foreach {
  logInfo(s"info org.apache.spark.internal.Logging print content")
        logError(s"error org.apache.spark.internal.Logging print content")
      }
    }
  )
}

Advantage

You can modify the log level when locating the problem
Logs can be aggregated through the yarn logs command

shortcoming

Additional inheritance of the org.apache.spark.internal.Logging class is required

extension:

Spark - Logging Specific usage reference: Spark - Logging simple use

3. Summary

1. It is recommended to use the method of inheriting the org.apache.spark.internal.Logging class to output executor-side logs.

2. The RoutingAppender provided by log4j2 can output each business log to a different file, which is convenient for problem location.

How to print logs on the Spark Executor side

1. Problem background

Two, the solution

2.1 Using scala.Console

2.2 Move the log4j2.xml logger configuration used by the driver to log4j-executor.properties

2.3 Inherit org.apache.spark.internal.Logging

3. Summary

Guess you like