的/ proc / DRBD
version: 8.3.13 (API:88/proto:86-96)GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org,2012-05-07 11:56:36 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:81 nr:407832 DW:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
直到我尝试装入卷:
mount -t ocfs2 /dev/drbd1 /data/webroot/mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.
/var/log/kern.log
kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds,giving up and returning errors.kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_try_to_join_domain:1210 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_join_domain:1488 ERROR: status = -107kernel: (mount.ocfs2,1):dlm_register_domain:1754 ERROR: status = -107kernel: (mount.ocfs2,1):ocfs2_dlm_init:2808 ERROR: status = -107kernel: (mount.ocfs2,1):ocfs2_mount_volume:1447 ERROR: status = -107kernel: ocfs2: Unmounting device (147,1) on (node 1)
以下是节点0(192.168.3.145)上的内核日志:
kernel: : (swapper,7):o2net_Listen_data_ready:1894 bytes: 0kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unkNown node at 192.168.2.93:43868kernel: : (o2net,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds,giving up and returning errors.kernel: : (o2net,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000,valID 0 -> 0,err 0 -> -107
我确定两个节点上的/etc/ocfs2/cluster.conf是相同的:
/etc/ocfs2/cluster.conf
node: ip_port = 7777 ip_address = 192.168.3.145 number = 0 name = SVR233NTC-3145.localdomain cluster = cpcnode: ip_port = 7777 ip_address = 192.168.2.93 number = 1 name = SVR022-293.localdomain cluster = cpccluster: node_count = 2 name = cpc
他们连接得很好:
# nc -z 192.168.3.145 7777Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!
但O2CB心跳在新节点上不活动(192.168.2.93):
/etc/init.d/o2cb状态
Driver for "configfs": Loaded@R_419_6852@system "configfs": MountedDriver for "ocfs2_dlmfs": Loaded@R_419_6852@system "ocfs2_dlmfs": MountedChecking O2CB cluster cpc: OnlineHeartbeat dead threshold = 31 Network IDle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000Checking O2CB heartbeat: Not active
以下是在节点1上运行tcpdump同时在节点1上启动ocfs2时的结果:
1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0 2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN,ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180 3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223 4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH,ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223 5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181 6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST,ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
每6个数据包发送一次RST标志.
我还可以做些什么来调试这个案例?
PS:
节点0上的OCFS2版本:
> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5
节点1上的OCFS2版本:
> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-308.el5-1.4.7-1.el5
更新1 – Sun Dec 23 18:15:07 ICT 2012
Are both nodes on the same lan segment? No routers etc.?
不,它们是不同子网上的2个VMWare服务器.
Oh,while I remember – hostnames/DNS all setup and working correctly?
当然,我在/ etc / hosts中添加了每个节点的主机名和IP地址:
192.168.2.93 SVR022-293.localdomain192.168.3.145 SVR233NTC-3145.localdomain
并且他们可以通过主机名相互连接:
# nc -z SVR022-293.localdomain 7777Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!# nc -z SVR233NTC-3145.localdomain 7777Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!
更新2 – 星期一12月24日18:32:15 ICT 2012
找到了线索:我的同事在群集运行时手动编辑了/etc/ocfs2/cluster.conf文件.因此,它仍然将死节点信息保存在/ sys / kernel / config / cluster /中:
# ls -l /sys/kernel/config/cluster/cpc/node/total 0drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomaindrwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain
(在这种情况下为SVR150-4107.localdomain)
我要停止集群删除死节点但是出现以下错误:
# /etc/init.d/o2cb stopStopPing O2CB cluster cpc: FailedUnable to stop cluster as heartbeat region still active
我确定ocfs2服务已经停止:
# mounted.ocfs2 -fDevice FS Nodes/dev/sdb ocfs2 Not mounted/dev/drbd1 ocfs2 Not mounted
没有参考了:
# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C2612963EAF4E16484DB81ECB0251177C26: 0 refs
我还卸载了ocfs2内核模块以确保:
# ps -ef | grep [o]cfs2root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq]# modprobe -r ocfs2# ps -ef | grep [o]cfs2# lsof | grep ocfs2
但没有变化:
# /etc/init.d/o2cb offlineStopPing O2CB cluster cpc: FailedUnable to stop cluster as heartbeat region still active
所以最后一个问题是:如何在不重启的情况下删除死节点信息?
更新3 – 星期一12月24日22:41:51 ICT 2012
这是所有正在运行的心跳线程:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2
此心跳区域的引用计数:
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB272EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs
试着杀死:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2ocfs2_hb_ctl: @R_419_6852@ not found by ocfs2_lookup while stopPing heartbeat
有任何想法吗?
解决方法 哦耶!问题解决了.注意UUID:
# mounted.ocfs2 -dDevice FS Stack UUID Label/dev/sdb ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1/dev/drbd1 ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1
但:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/drwxr-xr-x 2 root root 0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2
这可能发生,因为我“意外”强制重新形成OCFS2卷.我面临的问题类似于Ocfs2用户邮件列表上的this.
这也是以下错误的原因:
ocfs2_hb_ctl: @R_419_6852@ not found by ocfs2_lookup while stopPing heartbeat
因为ocfs2_hb_ctl在/ proc / partitions中找不到具有UUID 72EF09EA3D0D4F51BDC00B47432B1EB2的设备.
我想到了一个想法:我可以更改OCFS2卷的UUID吗?
浏览tunefs.ocfs2手册页:
Usage: tunefs.ocfs2 [options] <device> [new-size] tunefs.ocfs2 -h|--help tunefs.ocfs2 -V|--version[options] can be any mix of: -U|--uuID-reset[=new-uuID]
所以我执行以下命令:
# tunefs.ocfs2 --uuID-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1WARNING!!! OCFS2 uses the UUID to uniquely IDentify a @R_419_6852@ system. Having two OCFS2 @R_419_6852@ systems with the same UUID Could,in the least,cause erratic behavior,and if unlucky,cause @R_419_6852@ system damage. Please choose the UUID with care.Update the UUID ?yes
校验:
# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 72EF09EA3D0D4F51BDC00B47432B1EB2
试图再次杀死心跳区域,看看会发生什么:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB272EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs
继续杀戮,直到我看到0引用然后关闭群集:
# /etc/init.d/o2cb offline cpcStopPing O2CB cluster cpc: OK
并阻止它:
# /etc/init.d/o2cb stopStopPing O2CB cluster cpc: OKUnloading module "ocfs2": OKUnmounting ocfs2_dlmfs @R_419_6852@system: OKUnloading module "ocfs2_dlmfs": OKUnmounting configfs @R_419_6852@system: OKUnloading module "configfs": OK
重新开始查看新节点是否已更新:
# /etc/init.d/o2cb startLoading @R_419_6852@system "configfs": OKMounting configfs @R_419_6852@system at /sys/kernel/config: OKLoading @R_419_6852@system "ocfs2_dlmfs": OKMounting ocfs2_dlmfs @R_419_6852@system at /dlm: OKStarting O2CB cluster cpc: OK# ls -l /sys/kernel/config/cluster/cpc/node/total 0drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomaindrwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain
好的,在对等节点(192.168.2.93)上,尝试启动OCFS2:
# /etc/init.d/ocfs2 startStarting Oracle Cluster @R_419_6852@ System (OCFS2) [ OK ]
感谢Sunil Mushran,因为this线程帮助我解决了这个问题.
课程是:
> IP地址,端口,…只能在群集发生时更改
脱机.见
FAQ.>永远不要强制重新格式化OCFS2卷.
以上是内存溢出为你收集整理的linux – mount.ocfs2:安装时没有连接传输端点……?全部内容,希望文章能够帮你解决linux – mount.ocfs2:安装时没有连接传输端点……?所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)