一次docker错误的耗时排查过程记录

一次docker错误的耗时排查过程记录,第1张

一次docker错误的耗时排查过程记录

本文主要介绍一个耗时的docker错误记录过程,并通过示例代码进行了详细介绍。对你的学习或工作有一定的参考价值。有需要的朋友下面和边肖一起学习。

由来

这是一个深受客户信服的定制系统。它是基于centos修改的。经过半天的调查,发现是文件损坏,不是docker问题。

环境信息

码头信息:

$dockerinfo Containers:0 Running:0 Paused:0 Stopped:0 Images:2 ServerVersion:18.09.3 StorageDriver:overlay2 BackingFilesystem:xfs Supportsd_type:true NativeOverlayDiff:true LoggingDriver:json-file CgroupDriver:cgroupfs Plugins: Volume:local Network:bridgehostmacvlannulloverlay Log:awslogsfluentdgcplogsgelfjournaldjson-filelocallogentriessplunksyslog Swarm:inactive Runtimes:runc DefaultRuntime:runc InitBinary:docker-init containerdversion:e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e runcversion:6635b4f0c6af3810594d2770f662f34ddc15b40d initversion:fec3683 SecurityOptions: seccomp Profile:default KernelVersion:3.10.0 OperatingSystem:CentOSLinux7(Core) OSType:linux Architecture:x86_64 CPUs:20 TotalMemory:125.3GiB Name:eds-1f21a854 ID:VZLV:26PU:ZUN6:QQEJ:GW3I:YETT:HMEU:CK6J:SIPL:CHKV:6ASN:5NDF DockerRootDir:/data/kube/docker DebugMode(client):false DebugMode(server):false Registry:https://index.docker.io/v1/ Labels: Experimental:false InsecureRegistries: reg.wps.lan:5000 treg.yun.wps.cn 127.0.0.0/8 RegistryMirrors: https://registry.docker-cn.com/ https://docker.mirrors.ustc.edu.cn/ LiveRestoreEnabled:false ProductLicense:CommunityEngine

系统信息

$uname-a Linuxeds-1f21a8543.10.0#1SMPMonSep2812:00:30CST2020x86_64x86_64x86_64GNU/Linux $cat/etc/os-release NAME="CentOSLinux" VERSION="7(Core)" ID="centos" ID_LIKE="rhelfedora" VERSION_ID="7" PRETTY_NAME="CentOSLinux7(Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

服务器信息:

$catproduct_name SUGON-60G16 $catsys_vendor SANGFOR $catproduct_version 1.2 排查过程

Docker安装服务器已死亡。

时间:2020年10月29日19时51分:

实现:客户在此部署时,docker安装时服务器挂起
。I:启动后/var/log/message中有信息吗?
实现:我只能通过恢复快照来访问,服务器无法访问,我无法读取信息

此时,我认为是内核错误引发了内核恐慌。服务器无法启动。

2020年10月30日9:07:

我:起不来了就去控制台看为什么起不来了
实现:是客户端服务器查不到
我:客户端没查吗?

然后我直接发了一个向日葵远程连接。上去后发现不是正规的 *** 作系统,而是基于centos修改的。我没有找到/var/log/message,然后手动执行了我们的docker安装脚本。

bash-xinstall-docker.sh

那么输出信息在某一步没有输出,应该是“挂机”了。在下一个脚本中读取最后一个输出调试信息后,就是启动docker,应该是启动docker触发的。过了很久,还是无法和ping通。让实现者去问那边的站点,看看有没有idrac,ilo等。如果是硬件服务器,请查看tty控制台信息。

现场人员检查服务器“正常开机”,我试了试还是连接不上。现场人员问是不是我们 *** 作改了路线,现场systemctl检查docker是up。仍然没有办法锁定网关。这里突然想到我是不是根本没挂。。。

叫他查一下上次启动时间,根本没重启。。。

然后现场发现是iptables的问题,docker启动的时候他们的规则就被抹掉了。他们换了背,就放了。于是之前的创业docker挂了机。其实是因为iptables的影响断网了,机器根本没有重启。

启动容器以挂起

然后实现说之前同样的另一台机器安装docker的时候没有出现上述问题,但是启动的时候我手动进行了下一次部署,结果报错。script-x调试显示,当load部署映像时报告了错误。

errorduringconnect:Posthttp://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0:readunix@->/var/run/docker.sock:read:EOF

在手动执行下:

$dockerload-i./kube/images/deploy.tar errorduringconnect:Posthttp://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0:readunix@->/var/run/docker.sock:read:connectionresetbypeer

Jounalctl查看了没有任何相关日志的docker守护进程。搜了一下这个错误,有人说/var/run/docker.sock的docker组不存在,有人直接用chmod777解决了。我试过了,还是不行。Docker在前台调试下,看看有没有什么有用的信息:

systemctlstopdocker pkilldockerd dockerd-D

打开另一个终端执行加载镜像 *** 作:

$dockerload-i./kube/images/deploy.tar ab6425526dab:Loadinglayer[==================================================>]126.3MB/126.3MB c7fe3ea715ef:Loadinglayer[==================================================>]340.3MB/340.3MB 7f7eae7934f7:Loadinglayer[==================================================>]3.584kB/3.584kB e99a66706b50:Loadinglayer[==================================================>]2.105MB/2.105MB 245b79de6bb7:Loadinglayer[==================================================>]283.5MB/283.5MB 2a56a4432cef:Loadinglayer[==================================================>]93.18kB/93.18kB 0c2ea71066ab:Loadinglayer[==================================================>]276.5kB/276.5kB bb3f6df0f87c:Loadinglayer[==================================================>]82.94kB/82.94kB 6f34ead3cef7:Loadinglayer[==================================================>]946.7kB/946.7kB c21422dd15f6:Loadinglayer[==================================================>]1.97MB/1.97MB 940288517f4c:Loadinglayer[==================================================>]458.2kB/458.2kB 0c52f1af61f4:Loadinglayer[==================================================>]5.12kB/5.12kB 049e7edd48bc:Loadinglayer[==================================================>]1.57MB/1.57MB 73307245d702:Loadinglayer[==================================================>]5.632kB/5.632kB f109309d8ffc:Loadinglayer[==================================================>]2.175MB/2.175MB Loadedimage:xxxxxxxxxxxx.cn/platform/deploy-amd64:ubuntu.16.04 $dockerimages REPOSITORYTAGIMAGEIDCREATEDSIZE xxxxxxxxxxxx.cn/platform/deploy-amd64ubuntu.16.043ad94a76af5d3monthsago734MB

调试到这里,前台日志输出正常。

... DEBU[2020-10-30T14:48:39.955963986+08:00]Appliedtarsha256:049e7edd48bc46e3dd5edf89c9caa8f0f7efbb41af403c5a54dd4f1008f604a7tod58edd0d97bb672ef40e82e45c1603ca3ceaad847d9b9fc7c9b0588087019649,size:1518278 DEBU[2020-10-30T14:48:39.960091040+08:00]Applyingtarin/data/kube/docker/overlay2/b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18/diffstorage-driver=overlay2 DEBU[2020-10-30T14:48:40.059510528+08:00]Appliedtarsha256:73307245d7021f9627ca0b2cbfeab3aac0b65abfd476f6ec26bb92c75892d7e2tob044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18,size:3284 DEBU[2020-10-30T14:48:40.063040538+08:00]Applyingtarin/data/kube/docker/overlay2/03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f/diffstorage-driver=overlay2 DEBU[2020-10-30T14:48:40.148209852+08:00]Appliedtarsha256:f109309d8ffcb76589ad6389e80335d986b411c80122d990ab00a02a3a916e3eto03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f,size:2072803 ^CINFO[2020-10-30T14:48:55.593523177+08:00]Processingsignal'interrupt' DEBU[2020-10-30T14:48:55.593617229+08:00]daemonconfiguredwitha15secondsminimumshutdowntimeout DEBU[2020-10-30T14:48:55.593638628+08:00]startcleanshutdownofallcontainerswitha15secondstimeout... DEBU[2020-10-30T14:48:55.594074457+08:00]Unixsocket/run/docker/libnetwork/ebd15186e86385c48c4c5508d5f30eb83d5d74e56f09af5c82b6d6d9d63ec8b8.sockdoesn'texist.cannotacceptclientconnections DEBU[2020-10-30T14:48:55.594106623+08:00]Cleaningupoldmountid:start. INFO[2020-10-30T14:48:55.594157536+08:00]stoppingeventstreamfollowinggracefulshutdownerror="<nil>"module=libcontainerdnamespace=moby DEBU[2020-10-30T14:48:55.594343122+08:00]Cleaningupoldmountid:done. DEBU[2020-10-30T14:48:55.594501828+08:00]Cleanshutdownsucceeded INFO[2020-10-30T14:48:55.594520918+08:00]stoppinghealthcheckfollowinggracefulshutdownmodule=libcontainerd INFO[2020-10-30T14:48:55.594531978+08:00]stoppingeventstreamfollowinggracefulshutdownerror="contextcanceled"module=libcontainerdnamespace=plugins.moby DEBU[2020-10-30T14:48:55.594603119+08:00]receivedsignalsignal=terminated INFO[2020-10-30T14:48:55.594739890+08:00]pickfirstBalancer:HandleSubConnStateChange:0xc4201a61b0,TRANSIENT_FAILUREmodule=grpc INFO[2020-10-30T14:48:55.594751465+08:00]pickfirstBalancer:HandleSubConnStateChange:0xc4201a61b0,CONNECTINGmodule=grpc

看了systemd的配置,我就迷上了。不知道为什么前台运行的时候可以导入,但是后来实在想不出怎么检查,就怀疑可能是socket的问题。我试过用socat转发给tcp,结果还是不行。(这里应该尝试daemon和tcpmonitoring127,但是不应该通过socket。socat也是一直用到最后的插座。)

$socat-d-dTCP-LISTEN:2375,fork,bind=127.0.0.1UNIX:/var/run/docker.sock 2020/10/3017:39:58socat[5201]NlisteningonAF=2127.0.0.1:2375 ^[[C2020/10/3017:42:06socat[5201]NacceptingconnectionfromAF=2127.0.0.1:35370onAF=2127.0.0.1:2375 2020/10/3017:42:06socat[5201]Nforkedoffchildprocess11501 2020/10/3017:42:06socat[5201]NlisteningonAF=2127.0.0.1:2375 2020/10/3017:42:06socat[11501]NopeningconnectiontoAF=1"/var/run/docker.sock" 2020/10/3017:42:06socat[11501]NsuccessfullyconnectedfromlocaladdressAF=1"\0\0\0\0\0\0\x0D\x@7\xE9\xEC\x7E\0\0\0\x01\0\0\0\0" 2020/10/3017:42:06socat[11501]NstartingdatatransferloopwithFDs[6,6]and[5,5] 2020/10/3017:42:12socat[11501]Ewrite(5,0x55f098349920,8192):Brokenpipe 2020/10/3017:42:12socat[11501]Nexit(1) 2020/10/3017:42:12socat[5201]Nchilddied():handlingsignal17 $docker--log-leveldebug-Htcp://127.0.0.1:2375load-ikube/images/deploy.tar c7fe3ea715ef:Loadinglayer[==========================================>]286.9MB/340.3MB unexpectedEOF

最后,花了很长时间。当时我比较忙,去看了另一个客户的问题。然而,我会回到这里,心血来潮地尝试加载其他图像。结果还可以。。。

$dockerload-ikube/images/pause_3.1.tar e17133b79956:Loadinglayer[==================================================>]744.4kB/744.4kB Loadedimage:mirrorgooglecontainers/pause-amd64:3.1 $dockerload-ikube/images/tiller_v2.16.1.tar 77cae8ab23bf:Loadinglayer[==================================================>]5.815MB/5.815MB 679105aa33fb:Loadinglayer[==================================================>]6.184MB/6.184MB 639eab5d05b1:Loadinglayer[==================================================>]40.46MB/40.46MB 87e5687e03f2:Loadinglayer[==================================================>]41.13MB/41.13MB Loadedimage:gcr.io/kubernetes-helm/tiller:v2.16.1 $dockerload-ikube/images/calico_v3.1.3.tar cd7100a72410:Loadinglayer[==================================================>]4.403MB/4.403MB ddc4cb8dae60:Loadinglayer[==================================================>]7.84MB/7.84MB 77087b8943a2:Loadinglayer[==================================================>]249.3kB/249.3kB c7227c83afaf:Loadinglayer[==================================================>]4.801MB/4.801MB 2e0e333a66b6:Loadinglayer[==================================================>]231.8MB/231.8MB Loadedimage:calico/node:v3.1.3 2580685bfb60:Loadinglayer[==================================================>]50.84MB/50.84MB Loadedimage:calico/kube-controllers:v3.1.3 0314be9edf00:Loadinglayer[==================================================>]1.36MB/1.36MB 15db169413e5:Loadinglayer[==================================================>]28.05MB/28.05MB 4252efcc5013:Loadinglayer[==================================================>]2.818MB/2.818MB 76cf2496cf36:Loadinglayer[==================================================>]3.03MB/3.03MB 91d3d3a16862:Loadinglayer[==================================================>]2.995MB/2.995MB 18a58488ba3b:Loadinglayer[==================================================>]3.474MB/3.474MB 8d8197f49da2:Loadinglayer[==================================================>]27.34MB/27.34MB 7520364e0845:Loadinglayer[==================================================>]9.216kB/9.216kB b9d064622bd6:Loadinglayer[==================================================>]2.56kB/2.56kB Loadedimage:calico/cni:v3.1.3

只有当您导入它时,才会报告错误。

$dockerload-ikube/images/deploy.tar errorduringconnect:Posthttp://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0:readunix@->/var/run/docker.sock:read:connectionresetbypeer

然后,这个文件的检查值在外发机上进行了比较,发现是错误的。。。。

总结

有一个问题是前台为什么可以。其次,dockerdaemon实际上在文件损坏导入时不刷任何日志直接连接重置。这种情况在新版本中没有测试过。

关于费时的docker错误故障排除的这篇文章到此为止。有关docker错误的耗时故障排除的更多信息,请搜索我们以前的文章或继续浏览下面的相关文章。希望你以后能支持我们!

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zz/774215.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-05-03
下一篇 2022-05-03

发表评论

登录后才能评论

评论列表(0条)

保存