Flink 1.14.4 standalone 问题

Flink 1.14.4 standalone 问题,第1张

 官方配置:Configuration | Apache Flink

1、TM进程过一段时间就停止

报错信息:org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Task did not exit gracefully within 180 + seconds.
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
        at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.11-1.14.4.jar:1.14.4]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]

原因:任务取消超时

解决:TM配置文件${FLINK_HOME}/conf/flink-conf.yml

#取消任务取消watchdog

task.cancellation.timeout: 0

参数说明:Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.

2、web端上传的jar包,在独立集群重启后全部丢失

原因:文件默认保存在/tmp目录,会被清除

解决:JM配置文件${FLINK_HOME}/conf/flink-conf.yml

web.upload.dir: /usr/local/flink/upload
web.tmpdir: /usr/local/flink/tmpdir

3、JM stop-cluster.sh stop不能停止独立集群

原因:pid文件默认保存在/tmp目录,会被清除导致脚本找不到pid结束进程

解决:JM配置文件${FLINK_HOME}/conf/flink-conf.yml

env.pid.dir: /usr/local/flink/piddir

4、zookeeper存储value太长,zookeeper集群down掉导致TM全部down掉,zookeeper报错信息:

Unexpected exception causing shutdown while sock still open
java.io.IOException: Unreasonable length = 1970218037

    at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:95)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
    at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:249)


Zookeeper server went down in HA cluster. Please replay if there is any work around.     

You can attempt to increase your jute.maxbuffer Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.

解决方法:配置zookeeper的jute.maxbuffer参数到合适的长度

5、java.lang.OutOfMemoryError: Metaspace. 详细报错信息:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_291]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[?:1.8.0_291]
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_291]
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_291]
        at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_291]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_291]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_291]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_291]
        at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_291]

原因:没有找到具体原因,持续观察,网上搜索有两种说法:代码阻塞、背压

短期解决方案:TM配置文件${FLINK_HOME}/conf/flink-conf.yml

修改配置(默认256m)taskmanager.memory.jvm-metaspace.size: 512m

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/920012.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-05-16
下一篇 2022-05-16

发表评论

登录后才能评论

评论列表(0条)

保存