flink任务提交到yarn执行几天后报错:
2022-01-05 15:09:26,288 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 89574 for job cc0abb4a3cd870b2a9e1abc7235ceb91 (3528 bytes in 610 ms). 2022-01-05 15:09:29,544 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2022-01-05 15:09:30,678 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 89575 (type=CHECKPOINT) @ 1641366570678 for job cc0abb4a3cd870b2a9e1abc7235ceb91. 2022-01-05 15:09:30,729 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636 2022-01-05 15:09:30,729 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@prod-bigdata-pc3:42636]] Caused by: [java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636] 2022-01-05 15:09:31,482 INFO org.apache.flink.yarn.YarnResourceManager [] - Closing TaskExecutor connection container_e27_1640598151061_2774_01_000002 because: Container released on a *lost* node 2022-01-05 15:09:31,495 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636 2022-01-05 15:09:31,496 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@prod-bigdata-pc3:42636] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@prod-bigdata-pc3:42636]] Caused by: [java.net.ConnectException: 拒绝连接: prod-bigdata-pc3/10.5.2.133:42636] 2022-01-05 15:09:31,492 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> (Filter -> Map -> Sink: Unnamed, Filter -> Map -> Filter -> Map -> Sink: Unnamed) (1/1) (7d96c28b36c1b80514f188d59a885ca4) switched from RUNNING to FAILED on container_e27_1640598151061_2774_01_000002 @ prod-bigdata-pc3 (dataPort=44807). java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) ~[flink-dist_2.12-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [scala-library-2.11.12.jar:?] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [scala-library-2.11.12.jar:?] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [scala-library-2.11.12.jar:?] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [akka-actor_2.11-2.5.21.jar:2.5.21] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [akka-actor_2.11-2.5.21.jar:2.5.21] 2022-01-05 15:09:31,533 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 4081cf0163fcce7fe6af0cf07ad2d43c_0. 2022-01-05 15:09:31,539 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 1 tasks should be restarted to recover the failed task 4081cf0163fcce7fe6af0cf07ad2d43c_0. 2022-01-05 15:09:31,547 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job MainKafka2CK (cc0abb4a3cd870b2a9e1abc7235ceb91) switched from state RUNNING to RESTARTING. 2022-01-05 15:09:31,556 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Discarding the results produced by task execution 7d96c28b36c1b80514f188d59a885ca4. 2022-01-05 15:09:31,883 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> (Filter -> Map -> Sink: Unnamed, Filter -> Map -> Filter -> Map -> Sink: Unnamed) (1/1) (d556e4eb39bea674f6e10f51b009a535) switched from RUNNING to FAILED on container_e27_1640598151061_2774_01_000002 @ prod-bigdata-pc3 (dataPort=44807). java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) ~[flink-dist_2.12-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-runtime_2.11-1.11.3.jar:1.11.3] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) ~[flink-runtime_2.11-1.11.3.jar:1.11.3]
这个问题很可能是某个nodemanager节点资源不足,可能是由于它的CPU或磁盘使用率较高导致被标记为“lost”,想知道具体原因可以去这个被标记lost的节点上查看nodemanager的日志,被标记为lost的nodemanager通常需要重启下恢复
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)