linux – mount.ocfs2：安装时没有连接传输端点……？_系统运维

概述我用OCFS2替换了在双主模式下运行的死节点.所有步骤都有效：的/ proc / DRBD version: 8.3.13 (api:88/proto:86-96)GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org, 2012-05-07 11:56:36 1: 我用OCFS2替换了在双主模式下运行的死节点.所有步骤都有效：

的/ proc / DRBD

version: 8.3.13 (API:88/proto:86-96)GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org,2012-05-07 11:56:36 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----    ns:81 nr:407832 DW:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

直到我尝试装入卷：

mount -t ocfs2 /dev/drbd1 /data/webroot/mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.

/var/log/kern.log

kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds,giving up and returning errors.kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_try_to_join_domain:1210 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_join_domain:1488 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_register_domain:1754 ERROR: status = -107kernel: (mount.ocfs2,1):ocfs2_dlm_init:2808 ERROR: status = -107kernel: (mount.ocfs2,1):ocfs2_mount_volume:1447 ERROR: status = -107kernel: ocfs2: Unmounting device (147,1) on (node 1)

以下是节点0(192.168.3.145)上的内核日志：

kernel: : (swapper,7):o2net_Listen_data_ready:1894 bytes: 0kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unkNown node at 192.168.2.93:43868kernel: : (o2net,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds,giving up and returning errors.kernel: : (o2net,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000,valID 0 -> 0,err 0 -> -107

我确定两个节点上的/etc/ocfs2/cluster.conf是相同的：

/etc/ocfs2/cluster.conf

node:    ip_port = 7777    ip_address = 192.168.3.145    number = 0    name = SVR233NTC-3145.localdomain    cluster = cpcnode:    ip_port = 7777    ip_address = 192.168.2.93    number = 1    name = SVR022-293.localdomain    cluster = cpccluster:    node_count = 2    name = cpc

他们连接得很好：

# nc -z 192.168.3.145 7777Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!

但O2CB心跳在新节点上不活动(192.168.2.93)：

/etc/init.d/o2cb状态

Driver for "configfs": Loaded@R_419_6852@system "configfs": MountedDriver for "ocfs2_dlmfs": Loaded@R_419_6852@system "ocfs2_dlmfs": MountedChecking O2CB cluster cpc: OnlineHeartbeat dead threshold = 31  Network IDle timeout: 30000  Network keepalive delay: 2000  Network reconnect delay: 2000Checking O2CB heartbeat: Not active

以下是在节点1上运行tcpdump同时在节点1上启动ocfs2时的结果：

1   0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0  2   0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN,ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180  3   0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223  4   0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH,ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223  5   0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181  6   0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST,ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181

每6个数据包发送一次RST标志.

我还可以做些什么来调试这个案例？

PS：

节点0上的OCFS2版本：

> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

节点1上的OCFS2版本：

> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-308.el5-1.4.7-1.el5

更新1 – Sun Dec 23 18:15:07 ICT 2012

Are both nodes on the same lan segment? No routers etc.?

不,它们是不同子网上的2个VMWare服务器.

Oh,while I remember – hostnames/DNS all setup and working correctly?

当然,我在/ etc / hosts中添加了每个节点的主机名和IP地址：

192.168.2.93    SVR022-293.localdomain192.168.3.145   SVR233NTC-3145.localdomain

并且他们可以通过主机名相互连接：

# nc -z SVR022-293.localdomain 7777Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!# nc -z SVR233NTC-3145.localdomain 7777Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!

更新2 – 星期一12月24日18:32:15 ICT 2012

找到了线索：我的同事在群集运行时手动编辑了/etc/ocfs2/cluster.conf文件.因此,它仍然将死节点信息保存在/ sys / kernel / config / cluster /中：

# ls -l /sys/kernel/config/cluster/cpc/node/total 0drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomaindrwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain

(在这种情况下为SVR150-4107.localdomain)

我要停止集群删除死节点但是出现以下错误：

# /etc/init.d/o2cb stopStopPing O2CB cluster cpc: FailedUnable to stop cluster as heartbeat region still active

我确定ocfs2服务已经停止：

# mounted.ocfs2 -fDevice                FS     Nodes/dev/sdb              ocfs2  Not mounted/dev/drbd1            ocfs2  Not mounted

没有参考了：

# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C2612963EAF4E16484DB81ECB0251177C26: 0 refs

我还卸载了ocfs2内核模块以确保：

# ps -ef | grep [o]cfs2root     12513    43  0 18:25 ?        00:00:00 [ocfs2_wq]# modprobe -r ocfs2# ps -ef | grep [o]cfs2# lsof | grep ocfs2

但没有变化：

# /etc/init.d/o2cb offlineStopPing O2CB cluster cpc: FailedUnable to stop cluster as heartbeat region still active

所以最后一个问题是：如何在不重启的情况下删除死节点信息？

更新3 – 星期一12月24日22:41:51 ICT 2012

这是所有正在运行的心跳线程：

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'drwxr-xr-x 2 root root    0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2

此心跳区域的引用计数：

# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB272EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs

试着杀死：

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2ocfs2_hb_ctl: @R_419_6852@ not found by ocfs2_lookup while stopPing heartbeat

有任何想法吗？

解决方法哦耶！问题解决了.

注意UUID：

# mounted.ocfs2 -dDevice                FS     Stack  UUID                              Label/dev/sdb              ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1/dev/drbd1            ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1

但：

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/drwxr-xr-x 2 root root    0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

这可能发生,因为我“意外”强制重新形成OCFS2卷.我面临的问题类似于Ocfs2用户邮件列表上的this.

这也是以下错误的原因：

ocfs2_hb_ctl: @R_419_6852@ not found by ocfs2_lookup while stopPing heartbeat

因为ocfs2_hb_ctl在/ proc / partitions中找不到具有UUID 72EF09EA3D0D4F51BDC00B47432B1EB2的设备.

我想到了一个想法：我可以更改OCFS2卷的UUID吗？

浏览tunefs.ocfs2手册页：

Usage: tunefs.ocfs2 [options] <device> [new-size]       tunefs.ocfs2 -h|--help       tunefs.ocfs2 -V|--version[options] can be any mix of:        -U|--uuID-reset[=new-uuID]

所以我执行以下命令：

# tunefs.ocfs2 --uuID-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1WARNING!!! OCFS2 uses the UUID to uniquely IDentify a @R_419_6852@ system. Having two OCFS2 @R_419_6852@ systems with the same UUID Could,in the least,cause erratic behavior,and if unlucky,cause @R_419_6852@ system damage. Please choose the UUID with care.Update the UUID ?yes

校验：

# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 72EF09EA3D0D4F51BDC00B47432B1EB2

试图再次杀死心跳区域,看看会发生什么：

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB272EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

继续杀戮,直到我看到0引用然后关闭群集：

# /etc/init.d/o2cb offline cpcStopPing O2CB cluster cpc: OK

并阻止它：

# /etc/init.d/o2cb stopStopPing O2CB cluster cpc: OKUnloading module "ocfs2": OKUnmounting ocfs2_dlmfs @R_419_6852@system: OKUnloading module "ocfs2_dlmfs": OKUnmounting configfs @R_419_6852@system: OKUnloading module "configfs": OK

重新开始查看新节点是否已更新：

# /etc/init.d/o2cb startLoading @R_419_6852@system "configfs": OKMounting configfs @R_419_6852@system at /sys/kernel/config: OKLoading @R_419_6852@system "ocfs2_dlmfs": OKMounting ocfs2_dlmfs @R_419_6852@system at /dlm: OKStarting O2CB cluster cpc: OK# ls -l /sys/kernel/config/cluster/cpc/node/total 0drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomaindrwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

好的,在对等节点(192.168.2.93)上,尝试启动OCFS2：

# /etc/init.d/ocfs2 startStarting Oracle Cluster @R_419_6852@ System (OCFS2)                [  OK  ]

感谢Sunil Mushran,因为this线程帮助我解决了这个问题.

课程是：

> IP地址,端口,…只能在群集发生时更改
脱机.见
FAQ.>永远不要强制重新格式化OCFS2卷.

总结

以上是内存溢出为你收集整理的linux – mount.ocfs2：安装时没有连接传输端点……？全部内容，希望文章能够帮你解决linux – mount.ocfs2：安装时没有连接传输端点……？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/yw/1038561.html

linux – mount.ocfs2：安装时没有连接传输端点……？

发表评论

评论列表（0条）