org.apache.flink.util.FlinkException: The assigned slot container_16xxx was removed

Flink 消费kafka长时间为消费,lag信息并没有变换,查看TaskManager的日志得到如下信息:

org.apache.flink.util.FlinkException: The assigned slot container_1603071676168_0084_01_000009_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:899)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:869)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1090)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:391)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:845)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:362)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor.aroundReceive(Actor.scala:502)
	at akka.actor.Actor.aroundReceive$(Actor.scala:500)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
2020-10-19 11:28:26,781 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job SuperID_Stream (7cbca2569ff29f292c9ccabc80af5480) switched from state RUNNING to FAILING.

定位相关上下文日志信息得到:

2020-10-19 11:26:53,303 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job SuperID_Stream (7cbca2569ff29f292c9ccabc80af5480) switched from state RUNNING to FAILING.
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
	at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:140)
	at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
	at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
	at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
	at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:557)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:495)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1171)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
	at org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:253)

结合OOM和Container被remove进行定位分析:

可能的原因如下:

  • 1.TaskManager占用内存过大导致OOM 别Yarn kill掉了
  • 2.集群资源不够
  • 3.是否存在内存泄漏的问题

解决方案:

  • 1.将该Flink App调度在Per Slot内存更大的集群上。
  • 2.通过slotSharingGroup(“xxx”),减少Slot中共享Task的个数
  • 3.查看任务代码,看看是否是代码问题

相关链接:

https://www.jianshu.com/p/a2302724e6d6

https://blog.csdn.net/weixin_43655417/article/details/96970131

https://blog.csdn.net/u010942041/article/details/103731168

https://stackoverflow.com/questions/54095067/flink-slot-removed-exception

猜你喜欢

转载自blog.csdn.net/qq_43081842/article/details/109163449
今日推荐