View Spark task details

In the process of learning the Spark, view details of the task DAG, stage, task, etc. is an important means of learning, to be in this summary;

Environmental Information

Corresponding to the environment information herein is as follows:

  1. CentOS Linux release 7.5.1804
  2. JDK:1.8.0_191
  3. hadoop:2.7.7
  4. spark:2.3.2

Reference Document (ready to use environment)

Build hadoop, spark, running time-consuming task, please refer to the following article:

  1. Deploy hadoop: "Linux cluster deployment hadoop2.7.7" ;
  2. Spark cluster deployment mode on Yarn: "Deploying Spark2.2 cluster (on Yarn mode)" ;
  3. Develop a time-consuming computing task: "the Spark of combat: Wikipedia website statistical data analysis (java version)" ;

After more than ready, we have a cluster environment Spark can be used, and computing tasks are also ready.

Observation task runtime information

  1. For example, by executing the following command to start a spark tasks:
~/spark-2.3.2-bin-hadoop2.7/bin/spark-submit \
--class com.bolingcavalry.sparkdemo.app.WikiRank \
--executor-memory 2g \
--total-executor-cores 4 \
/home/hadoop/jars/sparkdemo-1.0-SNAPSHOT.jar \
192.168.121.150 \
8020

At this time, the console will have the following tips:

2019-10-07 11:03:54 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://node0:4040
  1. At this time master machine access browser port 4040, as shown visual information, events, DAG, stage has:
    Here Insert Picture Description
  2. Click the image above stage in the "Description", you can see all the task information in this stage:
    Here Insert Picture Description
  3. After the job run is complete, the console has the following information output, then re-visit 4040 webUI service port, has been found not visited:
2019-10-07 11:45:29 INFO  SparkUI:54 - Stopped Spark web UI at http://node0:4040

Historical observation mission

After the job, webUI 4040 port services provided also stopped, and want to look back at the end of the mission has information you need to configure and start the task information service history:

  1. Open the configuration file Spark-hadoop2.7-2.3.2-bin / the conf / Spark-defaults.conf , add the following three configurations:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node0:8020/var/log/spark
spark.eventLog.compress true

The above configuration, hdfs: // node0: 8020 is hdfs service address.

  1. Open the configuration file spark-env.sh , increase one of the following configurations:
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://node0:8020/var/log/spark"

The above configuration, hdfs: // node0: 8020 is hdfs service address.

  1. Execute the following command in namenode hdfs, the log folder created in advance:
~/hadoop-2.7.7/bin/hdfs dfs -mkdir -p /var/log/spark
  1. Start task service history:
~/spark-2.3.2-bin-hadoop2.7/sbin/start-history-server.sh
  1. After that spark task execution information will be saved down to access the master machine's port 18080, you can see all the historical information task, click to see the details, and the contents of the task in front of the run is the same:
    Here Insert Picture Description
    At this point, run-time and historical tasks job details can be observed, it can help us better learning and research spark.
Published 328 original articles · won praise 946 · Views 1.17 million +

Guess you like

Origin blog.csdn.net/boling_cavalry/article/details/102291920