- 背景
- Yarn 上面查看日志
Flink on yarn Cluster 模式运行一段时间后,程序突然报错, 查找Exceotion 发现 ”Container released on a *lost* node” 具体报错如下。
2021-12-24 09:43:43,931 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (6/24) (1af69dcda39c4eb77cc0af944aa33f70) switched from RUNNING to FAILED. java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2021-12-24 09:43:43,944 WARN org.apache.flink.yarn.YarnResourceManager - Discard registration from TaskExecutor container_e12_1621343092868_870015_01_000013 at (akka.tcp://flink@ip:42538/user/taskmanager_0) because the framework did not recognize it 2021-12-24 09:43:43,995 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_5. 2021-12-24 09:43:44,012 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_5. 2021-12-24 09:43:44,033 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RUNNING to RESTARTING. 2021-12-24 09:43:44,034 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from RUNNING to CANCELING. 2021-12-24 09:43:44,037 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 1af69dcda39c4eb77cc0af944aa33f70. 2021-12-24 09:43:44,073 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (6/24) (542a4aa5761420c0bdf76cfe60a6d607) switched from CANCELING to CANCELED. 2021-12-24 09:43:44,073 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607. 2021-12-24 09:43:44,075 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 542a4aa5761420c0bdf76cfe60a6d607. 2021-12-24 09:43:44,080 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (21/24) (0f92112ca6dc6e8b12d6375006a2269c) switched from RUNNING to FAILED. java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_20. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_20. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from RUNNING to CANCELING. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 0f92112ca6dc6e8b12d6375006a2269c. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (21/24) (812416954373cfd3b8e2ac0d6ab7609a) switched from CANCELING to CANCELED. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a. 2021-12-24 09:43:44,081 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 812416954373cfd3b8e2ac0d6ab7609a. 2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Filter -> Process -> (Sink: status, Sink: monitor, Sink: dtc, Sink: statusAll, Sink: other) (8/24) (493ea4995fe180b2dcc2965254973dd6) switched from RUNNING to FAILED. java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task 20ba6b65f97481d5570070de90e4e791_7. 2021-12-24 09:43:44,082 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 2 tasks should be restarted to recover the failed task 20ba6b65f97481d5570070de90e4e791_7. 2021-12-24 09:43:44,083 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job t5_parse_classify (e1b87d6b42299c0fc45dc648aa0e20cc) switched from state RESTARTING to FAILING. org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=300000,backoffTimeMS=10000,maxFailuresPerInterval=3) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180) at org.apache.flink.runtime.scheduler.Schedulerbase.updateTaskExecutionState(Schedulerbase.java:484) at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgaboutInternalTaskFailure(ExecutionGraph.java:1703) at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1252) at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1220) at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:955) at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173) at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:730) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:710) at org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:541) at org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:667) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:343) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) ... 22 moreYarn 上面查看日志
由上图可以粗略的推断出是因为某个节点丢失,造成的。
查看yarn的node,发现节点没有内存导致的(如下)。
找运维吧…
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)