How to Stop Spark Streaming Job gracefully

  As the streaming flow of the program, once up and running, essentially endless state, except in special circumstances, it will not stop. Because in all the time they are likely to process the data, if you want to stop execution also need to make sure the data currently being processed is completed, and can no longer accept new data, so as to ensure that data is not lost no weight.

  At the same time, also due to the flow of the program is rather special, so it can not directly kill -9 stopped this violent way, directly kill, then it is possible to lose data or repeated consumption data.

  

  Here's how elegant stop streaming job.

 

  The first: manually stop

  •   Program in set the following parameters:
sparkConf.set ( "spark.streaming.stopGracefullyOnShutdown", "true") // elegant closed
  •   Then the following steps
    • Find running through Hadoop 8088 page program
    • Open the monitoring page of spark UI
    • Open the monitoring page executor
    • Login Linux machine to find driving node IP and port number where the run
    • Then execute a command encapsulated
      sudo ss -tanlp |  grep 5555 |awk '{print $6}'|awk  -F, '{print $2}' | sudo  xargs kill -15

    This way is obviously more complicated.

  The second: to make use of HDFS message notification system

    In the driver, add a piece of code on every so often is the role of HDFS scan a document, if we find this file exists, it calls the Stop method StreamContext, elegant stop the program.

    HDFS here can be replaced reids, zk, hbase, db, only problem is dependent on an external storage system to achieve the purpose for message notification.

    This way, stop the program is relatively simple. Login There HDFS client machine and then touch an empty file to a specified directory, wait until the scan time interval, found that file exists, you need to close the program.

    Ado, on the code

    ssc.start()

    //check interval
    val checkIntervalMillis = 15000
    var isStopped = false

    println("before while")
    while (!isStopped) {
      println("calling awaitTerminationOrTimeout")
      isStopped = ssc.awaitTerminationOrTimeout(checkIntervalMillis)
      if (isStopped)
        println("confirmed! The streaming context is stopped. Exiting application...")
      else
        println("Streaming App is still running.")

      println("check file exists")
      if (!stopFlag) {
        val fs = FileSystem.get(new URI("hdfs://192.168.156.111:9000"),new Configuration())
        stopFlag = fs.exists(new Path("/stopMarker/marker"))
      }
      if (!isStopped && stopFlag) {
        println("stopping ssc right now")
        ssc.stop(true, true)
      }
    }

  Third: an internal socket or exposure http port for receiving a request, the program waits for closing the flow divider

    In this way need to start a socket threads, or http services driver. More recommended using http service because the underlying socket bit biased, slightly more complex to deal with.

    If you use http service can be directly embedded jetty, outside exposed to a http interface. Spark UI page also provides services with embedded jetty, there is no need to introduce additional dependencies in the pom file, when it is closed, find the IP driver is located, you can directly shut down the program flow directly through the browser or crul

    Find the IP driver is located, can be seen in the log of proceedings can be found on the spark master UI interface, this approach does not rely on any of the storage system only when the deployment requires an additional exposure http port number service.

 

We recommended to use the second or third, if you want to minimize dependence on external systems, it is recommended to use a third.

 

Reference documents: https://www.linkedin.com/pulse/how-shutdown-spark-streaming-job-gracefully-lan-jiang

 

Guess you like

Origin www.cnblogs.com/zbw1112/p/11959965.html