对于性能问题或者一些比较奇怪妖异的问题,有很多点可以着手去分析。
准备写一个系列关于用ash/dba_hist_active_sess_history,用oradebug,用linux命令strace,pstack或者用等等等等工具~~来归纳下一些思路,就是当目前为止所有分析的结果都没任何头绪的时候,接下来如何进行深入的troubleshooting。
比如当一个sql执行的很慢的时候,看看sql历史等待事件,看sql执行计划,但要是没什么特别明显的等待,或者执行计划看上去可以,感觉这个sql应该执行正常,为何执行的这么慢呢?那么可以用下oradebug来分析下进程的call stack,或者做这个session的state dump也看出点什么东西,从而分析出正确的结果以及root cause,当然对于session的历史等待时间ASH也是值得去深究的,执行计划更是一个很大的话题,所以这个系列也主要是为了自己在以后的工作中遇到类似的问题,或者没思路的时候,可以有些眉目~~
==============================================
首先写一个利用pstack和strace分析goldengate issue当做系列的第一篇~
这里就是数据库里没啥问题,goldengate也看不出什么毛病,感觉无从下手了~灵光一现bling bling的就解决问题了~
----------------------------------START--------------------------------------------------------
收到一个告警:
延迟了3hour+,登到ggsci里面去看
GGSCI (ora-bi-p-5.va2.b2c.nike.com) 2> !
info r4_ot
REPLICAT R4_OT Last Started 2016-11-02 05:23 Status RUNNING
INTEGRATED
Checkpoint Lag 00:00:00 (updated 00:00:04 ago)
Process ID 27164
Log Read Checkpoint File /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000020
2016-11-02 05:15:15.156360 RBA 535238535
(lag r4_ot的输出也显示没lag)
没lag呀,但是数据库里面信息却说有lag,那会不是是进程hang住了,先去数据库里面看下apply process的状态
select apply_name,null delay_min from dba_apply where status<>'ENABLED'
union
select capture_name,null from dba_capture where status<>'ENABLED'
union
select ac.apply_name, round((sysdate-applied_message_create_time)*1440)
from dba_apply_progress ap,GV$STREAMS_APPLY_COORDINATOR ac
where ac.apply_name=ap.apply_name
;
APPLY_NAME DELAY_MIN
------------------------------ ----------
OGG$R4_OT 247
于是看看ggserr.log ,view report等等都没看出啥来,因为没有什么报错呀!!!
这个时候怎么办呢,数据库说是apply process已经lag三个小时了,但是goldengate里面没看到任何的error log,
当然自己尝试了多次的info r4_ot,rba并没有改变,或许源端没事务在跑所以r4_ot rba不变动也是有这可能的~
突然想到来分析下进程状态,是不是在wait什么。
先进到数据库里面没看到ggadmin user的等待事件啥的,但是没有啊,所以现在的状态就是ogg也没报错,也显示没lag,在数据库里也没有相应的事务在跑,那么从何下手呢?
~~~~~~~~~~~~~~~~~~~~~~~~~
于是接下来想到去做个pstack来追踪下ogg r4_ot spid的状态
ggadmin@ora-bi-p-5:DCBIPRD5:/gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot$pstack 27164
Thread 5 (Thread 0x7f716ae20700 (LWP 27165)):
#0 0x00000038c700b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000692e91 in ggs::gglib::MultiThreading::Event::Wait(unsigned int) ()
#2 0x00007f717105a69c in ggs::gglib::gglog::LoggingTime::TimeThread(void*) () from /gg/app/ggadmin/product/12.1.2.1.0/libgglog.so
#3 0x00000000006908e4 in ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*) ()
#4 0x00000038c7007aa1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000038c6ce893d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f716a61f700 (LWP 27166)):
#0 0x00000038c700b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000692e91 in ggs::gglib::MultiThreading::Event::Wait(unsigned int) ()
#2 0x00007f71710c0f16 in ggs::gglib::gglog::LogBufferImpl::PublisherThread(void*) () from /gg/app/ggadmin/product/12.1.2.1.0/libgglog.so
#3 0x00000000006908e4 in ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*) ()
#4 0x00000038c7007aa1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000038c6ce893d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f71694e3700 (LWP 27167)):
#0 0x00000038c700ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000692e4d in ggs::gglib::MultiThreading::Event::Wait(unsigned int) ()
#2 0x00007f71710bcbaf in ggs::gglib::gglog::DOMConfiguratorImpl::watchThread(void*) () from /gg/app/ggadmin/product/12.1.2.1.0/libgglog.so
#3 0x00000000006908e4 in ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*) ()
#4 0x00000038c7007aa1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000038c6ce893d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f7163fff700 (LWP 27168)):
#0 0x00000038c700ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000692e4d in ggs::gglib::MultiThreading::Event::Wait(unsigned int) ()
#2 0x00007f71710a3e1a in CMessageFactoryImpl::RepetitionThread(void*) () from /gg/app/ggadmin/product/12.1.2.1.0/libgglog.so
#3 0x00000000006908e4 in ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*) ()
#4 0x00000038c7007aa1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000038c6ce893d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f716ae23a20 (LWP 27164)):
#0 0x00000038c700f00d in nanosleep () from /lib64/libpthread.so.0
#1 0x0000000000638330 in ggDelayMsecs(long) ()
#2 0x000000000057110e in wait_for_more(short) ()
#3 0x00000000005a539d in process_replicat_loop() ()
#4 0x00000000005bd51a in replicat_main(int, char**) ()
#5 0x00000000006905af in ggs::gglib::MultiThreading::MainThread::ExecMain() ()
#6 0x00000000006908e4 in ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*) ()
#7 0x00000000006909eb in ggs::gglib::MultiThreading::MainThread::Run(int, char**) ()
#8 0x00000000005bc1af in main ()
很明显可以看到进程在wait!!!!
再做个strace深入看看
Process 27164 attached
restart_syscall(<... resuming interrupted call ...>) = 0
futex(0x281ca64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x281ca60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x281ca38, FUTEX_WAKE_PRIVATE, 1) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
select(7, [6], NULL, NULL, {0, 0}) = 0 (Timeout)
lseek(14, 534773760, SEEK_SET) = 534773760
read(14, "Details</b></p><li>Flybeam struc"..., 1048576) = 477287
read(14, "", 1048576) = 0
lseek(14, 0, SEEK_SET) = 0
read(14, "F\0\7f0\0\3\2430\0\0\10GG\r\nTL\n\r1\0\0\2\0\0042\0\0\4 \0"..., 1048576) = 1048576
lseek(14, 1048576, SEEK_SET) = 1048576
lseek(14, 534773760, SEEK_SET) = 534773760
read(14, "Details</b></p><li>Flybeam struc"..., 1048576) = 477287
stat("/gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000021", 0x7ffcac69c610) = -1 ENOENT (No such file or directory)
nanosleep({1, 0}, NULL) = 0
-------------------------------------------
看到了stat("/gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000021", 0x7ffcac69c610) = -1 ENOENT (No such file or directory)
一目了然,下一个r4_ot要读的trail没找到
看看trail有哪些
GGSCI (ora-bi-p-5.va2.b2c.nike.com) 2> info r4_ot
REPLICAT R4_OT Last Started 2016-11-02 08:43 Status RUNNING
INTEGRATED
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)
Process ID 7212
Log Read Checkpoint File /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000020
2016-11-02 08:51:11.956186 RBA 535238535
GGSCI (ora-bi-p-5.va2.b2c.nike.com) 3> sh ls -lrt /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4*
-rw-r-----. 1 ggadmin oinstall 1023999938 Oct 31 16:41 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000018
-rw-r-----. 1 ggadmin oinstall 1023999769 Nov 1 18:53 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000019
-rw-r-----. 1 ggadmin oinstall 535251047 Nov 2 03:57 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000020
-rw-r-----. 1 ggadmin oinstall 204917668 Nov 2 05:23 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000346
-rw-r-----. 1 ggadmin oinstall 1264 Nov 2 05:23 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000347
-rw-r-----. 1 ggadmin oinstall 59060543 Nov 2 08:49 /gg/app/ggadmin/product/12.1.2.1.0/dirdat/replicate/pdot/r4000348
很明显序号seqno就不连续了,为什么会出现这种状况(先不考虑了,先修好)
于是alter R4_OT extseqno 0346 extrba 0
当然后面ogg alter之后还出现了一些问题,也顺利解决了,这里就略过不谈了。
最主要的其实还是想分享下这个思考的过程:
收到告警-》ogg没看出问题-》数据库的ogg user也没在跑r4_ot相关的事务,但就是apply_process有lag-》通过strace/pstack顺利troubleshooting到问题原因,从而解决问题。
Keep thinking~~
=============ENDED============================
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)