Summary of Troubleshooting Causes for Abnormal Restart of Storm Worker

At this moment, I am waiting for the arrival of 6.18, and I will write a blog while I have nothing to do,,,cool

 

The storm cluster will automatically start a new worker after the worker is down, but in many cases, when it feels that it should not be restarted, the worker restarts, so it is on the road to check the restart of the worker~

 

1. Investigation ideas

After investigation, the main conclusions are the following problems, which will cause the restart of the waker:

1. The code has an uncaught exception

In the following example, because the processed data has an exception and the exception is not caught in the code, the Exception is thrown to the JVM, causing the worker to go down.

For such an exception, you can see the corresponding exception information on the storm UI interface. Therefore, when troubleshooting, you can first check whether an exception is thrown in the UI.

 

java.lang.RuntimeException: java.lang.NumberFormatException: For input string: "赠品"
    at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:90)
    at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:61)
    at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62)
    at backtype.storm.daemon.executor$fn__3498$fn__3510$fn__3557.invoke(executor.clj:730)
    at backtype.storm.util$async_loop$fn__444.invoke(util.clj:403)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: "赠品"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong (Long.java:410)
    at java.lang.Long.parseLong (Long.java:468)
    at com.jd.ad.user.service.impl.UserOrderEntireUpdateServiceImpl.processUserOrderEntireData(UserOrderEntireUpdateServiceImpl.java:96)
    at com.jd.a

 

2. JVM memory overflow

Regarding this issue, the original scene has not been retained, mainly due to various reasons that cause problems with the garbage collection mechanism of the JVM, which eventually leads to memory overflow. This problem will also cause the worker to exit.

For this kind of exception, like in 1, it can also be seen in the UI interface. This requires specific investigation of the cause of JVM memory overflow.

 

3. There is no problem with the waker, and the supervisor restarts the waker

For the problems in 1 and 2, the thrown exception information can be found in the UI interface and the log file of the waker. However, we also encountered another situation, that is, no exception information can be found in the waker. , but there is always a random restart of the waker.

 

Because the abnormal information cannot be found in the worker, you need to check the log information in the supervisor at this time, because the supervisor will monitor the status information in the worker on the local machine, and the restart of the worker is also operated by the supervisor.

 

At this point, viewing the log information in the supervisor, you can see the following:

2017-06-17 23:36:08 b.s.d.supervisor [INFO] Shutting down and clearing state for id 867ed61b-a9d5-423e-bb0b-b2e428369140.
Current supervisor time: 1497713767. State: :timed-out,
Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1497713735, :storm-id "data_process_1-170-1497518898",
:executors #{[578 578] [868 868] [1158 1158] [1448 1448] [1738 1738] [-1 -1] [288 288]}, :port 6716}

 Therefore, it can be judged that it is due to the timeout of the supervisor obtaining the status information ( in fact, it is not clear who the supervisor obtains the status information, because the status information of the worker is in the local file system, is it the status of the executor? I hope everyone will complain about it), which led to shut down the waker, and then restarted the waker. And check the status of the machine at this time, and find that the CPU load of the zookeeper machine will occasionally appear unstable. As shown below:

 

 

Therefore, it can be concluded that the supervisor has timed out to obtain status information.

 

Communicate with the operation and maintenance, the three machines of zookeeper are deployed on the same machine as the supervisor, so it will cause the machine to be unstable.

 

After passing 618 smoothly, I am ready to go home.

 


 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326396134&siteId=291194637
Recommended