Flink : Lost connection to task manager xxx This indicates that the remote task manager was lost

1.美图

在这里插入图片描述

2.背景

原本一个好好地程序,是Flink 1.9写的,但是要升级为flink 1.10的时候,发现性能降低了20%。参考:16-flink 1.9 和 flink 1.10 性能测试对比(在本地)

然后让在fllink 1.10中添加一个参数

taskmanager.memory.managed.fraction: 0

大牛说


看这个作业的 1.9 的配置,应该是没用 rocksdb 吧。这种情况下之前 1.9 默认的 managed memory 是 on-heap & non-preallocate 的,没有人用的 managed memory 全部都留给 heap 用了。1.10 managed memory 全部 off-heap 之后该,这部分内存在他这个作业里应该是用不起来了。

然后重启任务,结果,任务界面 由绿色变成蓝色最后变成黄色和红色

在这里插入图片描述
在这里插入图片描述
然后过了一会又变成绿色。

查看container 周期性的消失一个

在这里插入图片描述
消失一个
在这里插入图片描述
然后又过一会重新变好,页面也会变绿色。

变坏的contain日志如下

2020-04-09 19:16:46,439 INFO  org.apache.flink.runtime.taskmanager.Task                     - Aviator Rule Engine (3/12) (735c1bdbf7b5de50d08aa180054ed69a) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '1.datanode3/72.118.0.15:46791'. This might indicate that the remote task manager was lost.
	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:745)

界面也会显示

在这里插入图片描述
yarn中看到日志,看起来是资源不足

2020-04-09 19:07:23,100 INFO  org.apache.flink.yarn.YarnResourceManager                     - Closing TaskExecutor connection container_1585309476321_0023_01_000002 because: Container [pid=3410,containerID=container_1585309476321_0023_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1585309476321_0023_01_000002 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 3410 3405 3410 3410 (bash) 5 5 108703744 631 /bin/bash -c /usr/jdk64/jdk1.8.0_112/bin/java -Xmx1514647055 -Xms1514647055 -XX:MaxDirectMemorySize=317424929 -XX:MaxMetaspaceSize=100663296 -Dlog.file=/data/hadoop/yarn/log/application_1585309476321_0023/container_1585309476321_0023_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=183207201b -D taskmanager.memory.network.min=183207201b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=0b -D taskmanager.cpu.cores=4.0 -D taskmanager.memory.task.heap.size=1380429327b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Dweb.port='0' -Djobmanager.rpc.address='1.datanode2' -Dweb.tmpdir='/tmp/flink-web-4f0ac083-3f1e-4cf7-8c5f-eba8bc140b42' -Djobmanager.rpc.port='38463' -Drest.address='1.datanode2' 1> /data/hadoop/yarn/log/application_1585309476321_0023/container_1585309476321_0023_01_000002/taskmanager.out 2> /data/hadoop/yarn/log/application_1585309476321_0023/container_1585309476321_0023_01_000002/taskmanager.err
        |- 3669 3410 3410 3410 (java) 125770 4562 4080103424 525294 /usr/jdk64/jdk1.8.0_112/bin/java -Xmx1514647055 -Xms1514647055 -XX:MaxDirectMemorySize=317424929 -XX:MaxMetaspaceSize=100663296 -Dlog.file=/data/hadoop/yarn/log/application_1585309476321_0023/container_1585309476321_0023_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=183207201b -D taskmanager.memory.network.min=183207201b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=0b -D taskmanager.cpu.cores=4.0 -D taskmanager.memory.task.heap.size=1380429327b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Dweb.port=0 -Djobmanager.rpc.address=1.datanode2 -Dweb.tmpdir=/tmp/flink-web-4f0ac083-3f1e-4cf7-8c5f-eba8bc140b42 -Djobmanager.rpc.port=38463 -Drest.address=1.datanode2

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143


发布了1216 篇原创文章 · 获赞 459 · 访问量 156万+

猜你喜欢

转载自blog.csdn.net/qq_21383435/article/details/105418013