Flink on Yarn报错:Container released on a *lost* node

Flink on Yarn报错:Container released on a *lost* node,第1张

Flink on Yarn报错:Container released on a *lost* node

flink任务提交到yarn执行几天后报错:

2022-01-05 15:09:26,288 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 89574 for job cc0abb4a3cd870b2a9e1abc7235ceb91 (3528 bytes in 610 ms).
2022-01-05 15:09:29,544 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2022-01-05 15:09:30,678 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 89575 (type=CHECKPOINT) @ 1641366570678 for job cc0abb4a3cd870b2a9e1abc7235ceb91.
2022-01-05 15:09:30,729 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636
2022-01-05 15:09:30,729 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@prod-bigdata-pc3:42636]] Caused by: [java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636]
2022-01-05 15:09:31,482 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Closing TaskExecutor connection container_e27_1640598151061_2774_01_000002 because: Container released on a *lost* node
2022-01-05 15:09:31,495 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636
2022-01-05 15:09:31,496 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@prod-bigdata-pc3:42636]] Caused by: [java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636]
2022-01-05 15:09:31,492 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: Custom Source -> (Filter -> Map -> Sink: Unnamed, Filter -> Map -> Filter -> Map -> Sink: Unnamed) (1/1) (7d96c28b36c1b80514f188d59a885ca4) switched from RUNNING to FAILED on container_e27_1640598151061_2774_01_000002 @ prod-bigdata-pc3 (dataPort=44807).
java.lang.Exception: Container released on a *lost* node
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) ~[flink-dist_2.12-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [scala-library-2.11.12.jar:?]
	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [scala-library-2.11.12.jar:?]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [scala-library-2.11.12.jar:?]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [scala-library-2.11.12.jar:?]
	at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.actor.ActorCell.invoke(ActorCell.scala:561) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [akka-actor_2.11-2.5.21.jar:2.5.21]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [akka-actor_2.11-2.5.21.jar:2.5.21]
2022-01-05 15:09:31,533 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 4081cf0163fcce7fe6af0cf07ad2d43c_0.
2022-01-05 15:09:31,539 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 1 tasks should be restarted to recover the failed task 4081cf0163fcce7fe6af0cf07ad2d43c_0. 
2022-01-05 15:09:31,547 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job MainKafka2CK (cc0abb4a3cd870b2a9e1abc7235ceb91) switched from state RUNNING to RESTARTING.
2022-01-05 15:09:31,556 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 7d96c28b36c1b80514f188d59a885ca4.
2022-01-05 15:09:31,883 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: Custom Source -> (Filter -> Map -> Sink: Unnamed, Filter -> Map -> Filter -> Map -> Sink: Unnamed) (1/1) (d556e4eb39bea674f6e10f51b009a535) switched from RUNNING to FAILED on container_e27_1640598151061_2774_01_000002 @ prod-bigdata-pc3 (dataPort=44807).
java.lang.Exception: Container released on a *lost* node
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) ~[flink-dist_2.12-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]

这个问题很可能是某个nodemanager节点资源不足,可能是由于它的CPU或磁盘使用率较高导致被标记为“lost”,想知道具体原因可以去这个被标记lost的节点上查看nodemanager的日志,被标记为lost的nodemanager通常需要重启下恢复

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/5700932.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存