Flink on YARN (lower): Frequently asked questions and troubleshooting ideas

Author: Yang Tao (risk far)

Standalone deployment and support independent Flink YARN, Kubernetes, Mesos and other cluster deployment model, where the application YARN cluster deployment model in China is more and more widely. Flink Flink on YARN community will launch the application interpretation of a series of articles, divided into upper and lower two. Part shared resource scheduling model based on reconstructed FLIP-6 Introduction Flink on YARN application to start the whole process, this paper will be based on community feedback crowd, client and Flink Cluster answer frequently asked questions, share ideas related to troubleshoot problems.

Client FAQs and troubleshooting ideas

▼ exception information submitted application console: Could not build the program from JAR file.

Confusing the issue of large, often caused not specify the JAR file running problem, but an exception occurs during submit, the need for further investigation based on the log information. The most common reason for this is not to rely on Hadoop JAR file to the CLASSPATH, can not find dependent classes (for example: ClassNotFoundException: org.apache.hadoop.yarn.exceptions.YarnException) causes a load client entry class (FlinkYarnSessionCli) failed.

** ▼ Flink on how to submit applications when YARN YARN linked to a specific cluster?
**

Flink on YARN clients typically need to configure HADOOP_CONF_DIR and HADOOP_CLASSPATH two environment variables to let the client can be loaded into the Hadoop configuration and dependency JAR file. Examples of (existing Hadoop deployment environment variable HADOOP_HOME specified directory):

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath`

▼ client log where, how to configure?

The client logs typically Flink log file at deployment directory folder: $ {FLINK_HOME} / log / flink - $ {USER} -client-.log, using log4j configuration: $ {FLINK_HOME} /conf/log4j-cli.properties.

Some client environment is more complex, difficult to locate log location and configuration, you can configure the following environment variables open the DEBUG log log4j, log4j initialization tracking and detailed loading process: export JVM_ARGS = "- Dlog4j.debug = true"

▼ client problems troubleshooting ideas

When a client logs can not locate properly, you can modify the log level log4j configuration file will re-run after the DEBUG INFO changed, to see if there DEBUG logs to help troubleshoot problems. For some questions there is no log or incomplete information, may be required to carry out code-level debugging, modify the source code to repackage alternative way too cumbersome, it is recommended to use Java bytecode injection tool Byteman (Please refer to the detailed description of the syntax: Byteman Document), example of use:

(1) write a script debugger, such as printing Flink Client class actually used, the following script indicates that the print its value is returned when the function exits CliFrontend # getActiveCustomCommandLine;

RULE test
CLASS org.apache.flink.client.cli.CliFrontend
METHOD getActiveCustomCommandLine
AT EXIT
IF TRUE
DO traceln("------->CliFrontend#getActiveCustomCommandLine return: "+$!);
ENDRULE

(2) setting environment variables, using byteman javaagent:

export BYTEMAN_HOME=/path/to/byte-home
export TRACE_SCRIPT=/path/to/script
export JVM_ARGS="-javaagent:${BYTEMAN_HOME}/lib/byteman.jar=script:${TRACE_SCRIPT}"

(3) Run the test command bin / flink run -m yarn-cluster -p 1 ./examples/streaming/WordCount.jar, the console output content:

------->CliFrontend#getActiveCustomCommandLine return: org.apache.flink.yarn.cli.FlinkYarnSessionCli@25ce9dc4

Flink Cluster FAQs and troubleshooting ideas

▼ user application framework JAR package and version conflicts

The problem usually throws NoSuchMethodError / ClassNotFoundException / IncompatibleClassChangeError and other abnormalities, to solve this problem:

**
1. First need to rely According to locate the library exception class *, then you can execute mvn dependency in the project: tree in a tree structure showing all the dependency chain, and then locate the conflict from dependent libraries, you can also increase the parameters to be displayed -Dincludes packet format [groupId]: [artifactId]: [type]: [version], matching support, a plurality of separated by commas, for example: mvn dependency: tree -Dincludes = power , javaassist;

2. Locate the package necessary to consider how the conflict row package, the simple solution is to use exclusion to exclude reliance passes from his dependence over the project, but some scenarios require the coexistence of multiple versions, different versions of different components dependent on it to consider the use of Maven Shade plugin to solve, please refer to Maven Shade plugin.

▼ dependent library JAR multiple versions of how to determine the specific source when certain types of packages coexist?

There is the same dependencies of many applications to run multiple versions of the CLASSPATH JAR package, resulting in actual use the version with the relevant loading order, often need to determine the source of a class JAR when troubleshooting, Flink support to the JM / TM process JVM configuration parameters, printing can thus be loaded by the following three classes and their source configuration item (.out log output), depending on which one can be selected:

env.java.opts=-verbose:class   //配置JobManager&TaskManager
env.java.opts.jobmanager=-verbose:class //配置JobManager env.java.opts.taskmanager=-verbose:class //配置TaskManager

How to complete the log viewing application ▼ Flink?

Flink application running JM / TM logs can be viewed on the WebUI, but usually requires a combination of full log analysis to look into the problem when troubleshooting, so you need to know to save a log of the mechanism YARN, YARN log on Container location to save application state with about:

1. If the application is not over, Container logs will remain on the node on which it runs, even if the operation has been completed Container can still be found in the directory where the configuration node: $ {yarn.nodemanager.log-dirs} // , you can also be accessed directly from the WebUI: HTTP: /// the Node / containerlogs //

2. If the application has been completed and the cluster-enabled log collection (yarn.log-aggregation-enable = true), it is usually the end of the application (you can also configure incremental uploads) NM All logs will be uploaded to its distributed storage (usually is HDFS) and delete the local file, we can see all the logs command application by yarn yarn logs -applicationId -appOwner, you can also add the parameter item -containerId -nodeAddress to view a log of container, can also be accessed directly distributed storage directory: $ {yarn.nodemanager.remote-app-log-dir} / $ {user} / $ {yarn.nodemanager.remote-app-log-dir-suffix} /

▼ Flink application resource allocation troubleshooting ideas

If the application can not start Flink reach normal RUNNING state, you can isolate it by following these steps:

1. The need to check the current status of the application, according to the description of the startup process, we know:

  • Progress in the application of information persist when NEW_SAVING state, if we continue in this state need to check the status of RM storage service (usually a ZooKeeper cluster) is normal;
  • If in the SUBMITTED state, it could be time-consuming operation occurs inside the RM-write lock hold some of the events leading to accumulation, according to the need for further positioning YARN cluster logs;
  • If the state is ACCEPTED, AM need to check whether the normal, proceeds to step 2;
  • If you have a RUNNING state, but not all the resources to get the result in JOB can not function properly, skip to Step 3;

2. Check the AM is normal, you can show the application interface from YARN ( : HTTP /// Cluster / App / () YARN application or API REST : /// WS / v1 / Cluster / Apps / HTTP view diagnostics information), based on key problem Cause word clear information and solutions:

-. Queue's AM resource limit exceeded due to reach the queue AM available resources ceiling, ie the queue AM used resources and AM new application resources sum exceeds AM upper limit of the resource queue, you can adjust the queue AM CI percentage of available resources : yarn.scheduler.capacity..maximum-am-resource-percent.

- User's AM resource limit exceeded due to reach the application belongs to the user of available resources ceiling in AM the queue, that is, the application belongs users have used resources and AM new application resources sum beyond the application belongs to the user in AM the queue in the queue. AM upper limit of resources available to the user can be an appropriate increase in the proportion of AM resources to solve this problem, the relevant configuration items: yarn.scheduler.capacity..user-limit-factor and yarn.scheduler.capacity..minimum-user-limit-percent.

- AM container is launched, waiting for AM container to Register with RM roughly because AM has been started, but not completed internal initialization, there may ZK connection timeouts and other issues, the specific reasons need to log AM investigation, according to the specific problem to solve.

-. Application is Activated, waiting for resources to be assigned for AM AM application information indicating the check has passed, the scheduler is waiting for assignment, then the resource scheduler checks the required level, jumps to step 4.

3. Make sure the application does have unmet resource request YARN: Click the problem from the application list page application ID to enter the application page, and then click Apply instance ID list below to enter the application examples page to see if there Total Outstanding Resource Requests resource list Pending If not, YARN description has been allocated, the process checks the exit, turn to check AM; if so, that the scheduler assigned not completed, proceeds to step 4;

4. scheduler allocates troubleshooting, YARN-9050 supports REST API or through automatic diagnostic application problems, will be released in Hadoop3.3.0 on the WebUI, the previous version of the investigation remains to be done manually:

  • Check the cluster or queue resource, scheduler queue leaves page tree view expanded view resource information: Effective Max Resource, Used Resources: (1) Check the cluster where the resources or resources or its parent queue if the queue has run out of resources; (2) Check the leaves whether a dimension resource queue approach or reach the upper limit;
  • Check for resource fragmentation: and the ratio of (1) Used to check the cluster resource Reserved resources and total resources, when approaching a full cluster resources (for example, more than 90%), there may be debris resource allocation on the speed of the application slower affected, because most of the machines do not have the resources, and the machine will be inadequate resources available to achieve the reserve, reserved resources after a certain amount of resources may lead to most of the machines are locked, follow-up assignment might slow down; (2) inspection NM distribution of available resources, even if the cluster resource usage is not high, it could also be because of the different dimensions of each resource distribution caused by, for example, memory resources on the node closer to 1/2 full with more CPU resources remaining, 1/2 CPU on the node resources remaining close with full memory resources are more resource values ​​of a dimension in the application configuration resource is too large may also cause not apply to resources;
  • Check for high-priority applications frequent application problems and issues immediate release of resources, this situation will cause the scheduler busy satisfy resource requests an application and attend to other applications;
  • Check for Container failed to start or just start automatically withdraw the case, you can view the log Container (including localize log, launch logs, etc.), YARN NM log or log YARN RM investigation.

▼ TaskManager 启动异常:
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is ... found ...

When the exception Flink AM application start has timed token of Container YARN NM to throw, usually because Flink AM after a long time from receipt of this Container YARN RM (more than Container valid time, default 10 minutes, the Container has been released) start it before going further because the internal serial Flink start after receiving Container resources YARN RM returned.

When a large number Container to be launched and distributed file storage such as HDFS slower performance (before starting configuration to be uploaded TaskManager) Container start request easily accumulate inside, FLINK-13184 on this issue has been optimized, one at the start before It increases the effectiveness of checks, to avoid meaningless configuration upload process, second is asynchronous multi-threading optimization, faster boot.

▼ Failover 异常 1:
java.util.concurrent.TimeoutException: Slot allocation request timed out for ...

Abnormal is TaskManager application of resources can not be allocated properly, you can press Flink application resource allocation problem troubleshooting ideas step 4 to troubleshoot the problem.

▼ Failover 异常 2:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id timed out.

TaskManager is the direct cause of abnormal heartbeat timeout, there may be further reasons:

  • Process has exited, may itself be an error, or by the impact of the mechanism to seize on YARN RM or NM, needs to be further traced TaskManager log or YARN RM / NM log;
  • Process is still running, the cluster network problems caused by lost contact, the connection times out on their own exit, JobManager after the exception Failover self-healing (re-apply resources and start a new TaskManager);
  • GC process takes too long, it may be a memory leak or memory caused by the irrational allocation of resources, the need to further localize the specific reasons based on log analysis or memory.

▼ Failover 异常 3:
java.lang.Exception: Container released on a lost node

Abnormal Container is the node where the run is marked YARN cluster is LOST, all Container on the node will be YARN RM active release and notify AM, JobManager Failover will receive their own recovery (re-apply and start the new resources after this exception the TaskManager), left TaskManager process may quit on their own after a timeout.

▼ Flink Cluster difficult troubleshooting ideas

First, according to JobManager / TaskManager positioning of log analysis, a complete log refer to "How to complete the log Flink application view", if you want to get DEBUG information, you need to modify log4j configuration JobManager / TaskManager of ($ {FLINK_HOME} /conf/log4j.properties) after resubmitting run for the process is still running, it is recommended to use Java bytecode tool Byteman related to glimpse the state of internal processes, a detailed description please refer to: How Do I Install the Agent Into a running Program?

Reference material

Green text fonts are part of the jump, detailed reference information, see the link below:

Byteman Documents

Maven Shade Plugin

YARN-9050

FLINK-13184

How Do I Install The Agent Into A Running Program?

Flink on the YARN, under two articles of Flink on YARN application to start the whole process to sort out, and the client and FAQs Flink Cluster provides troubleshooting ideas for your reference, hope in the application practice be able to help you .


▼ Apache Flink community recommendation ▼

Apache Flink and large data fields top event Flink Forward Asia 2019 opened heavy agenda of the General Assembly exciting on-line to learn more information Flink Forward Asia 2019, please see:

https://developer.aliyun.com/special/ffa2019

The first Apache Flink Geeks Challenge heavyweight open, focusing on machine learning and performance optimization are two hot area, so you get 400,000 bonus added to the challenge please click:

https://tianchi.aliyun.com/markets/tianchi/flink2019 

 

Author: Basu live

Original link: https://yq.aliyun.com/articles/719703?utm_content=g_1000079636 

This article Yunqi community original content may not be reproduced without permission.

Guess you like

Origin www.cnblogs.com/bokeyuanxiao/p/11672167.html