Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: All datanodes DatanodeInfoWithStorage[10.21.131.179:50010,DS-6fca3fba-7b13-4855-b483-342df8432e2a,DISK] are bad. Aborting... at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1835) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: All datanodes DatanodeInfoWithStorage[10.21.131.179:50010,DS-6fca3fba-7b13-4855-b483-342df8432e2a,DISK] are bad. Aborting... at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:731) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at
这种报错,在高负载集群经常会出现,尤其是大任务,两个小时以上的任务,失败的概率很高。这种报错是什么原因造成的呢?
其实本质很简单,task挂在了Reduce阶段Failed。因为reduce需要从各个MAP所在的节点拉取数据,通过HTTP请求形式从该节点指定目录下载数据。如果请求下载失败,就会标记该主机磁盘异常。,DS-aa55b1c5-4964-4161-8e36-322f29401ca1,DISK] are bad. Aborting
出现这种情况一般是主机短暂时间掉线,从集群脱离,和NM失去了联系。直接CDH的监控是监控不到的,看主机一切正常。需要单独监控各个节点与NM的通信情况,通信超时则告警。
看了一下监控告警邮件,果然如此。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)