oracle报03113,【案例】Oracle报错ORA-03113 ORA-15064产生原因和解决办法

本文详细分析了一个Oracle RAC数据库实例遇到ORA-03113和ORA-15064错误的情况,原因是数据库实例的asmb进程崩溃,导致数据库实例被CRS重启。通过对数据库和ASM实例的日志、trace文件及diag进程的分析,发现asmb进程等待时间过长,但未找到明确的崩溃原因。最后推测可能是Oracle的bug或系统负载过高引起的问题,建议考虑升级或进一步排查。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

【案例】Oracle报错ORA-03113 ORA-15064产生原因和解决办法

时间:2016-12-11 18:47   来源:Oracle研究中心   作者:网络   点击:

天萃荷净

Oracle研究中心案例分析:运维DBA反映Oracle RAC数据库报错ORA-03113 ORA-15064,分析原因为数据库实例的asmb进程crash,然后导致数据库实例crash,进而被crs给重启了。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: 10gR2 rac(asm) crash with ora-15064

以前同事发过来一个rac log,让帮忙分析一下,该问题2011年我也看过,当时没有详细分析,这次我们再来看看,能否找到根本的原因,首先我们来看database的alert log:

Mon Sep 17 09:59:37 2012

Thread 1 advanced to log sequence 8631

Current log# 1 seq# 8631 mem# 0: +DG_DATA/cmsdb3/onlinelog/redo01_1.log

Current log# 1 seq# 8631 mem# 1: +DG_DATA/cmsdb3/onlinelog/redo01_2.log

Mon Sep 17 10:35:00 2012

Errors in file /oracle/admin/cmsdb3/bdump/cmsdb31_asmb_1155.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Mon Sep 17 10:35:00 2012

ASMB: terminating instance due to error 15064

Mon Sep 17 10:35:00 2012

System state dump is made for local instance

System State dumped to trace file /oracle/admin/cmsdb3/bdump/cmsdb31_diag_639.trc

Mon Sep 17 10:35:02 2012

Shutting down instance (abort)

License high water mark = 177

Mon Sep 17 10:35:05 2012

Instance terminated by ASMB, pid = 1155

Mon Sep 17 10:35:07 2012

Instance terminated by USER, pid = 106

Mon Sep 17 10:35:16 2012

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 ce1 192.168.0.0 configured from OCR for use as a cluster interconnect

Interface type 1 ce0 172.16.0.0 configured from OCR for use as  a public interface

Picked latch-free SCN scheme 3

WARNING: db_recovery_file_dest is same as db_create_file_dest

Autotune of undo retention is turned on.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.3.0.

System parameters with non-default values:

processes                = 723

sessions                 = 800

从上面错误可以清楚的看到,在时间点Mon Sep 17 10:35:00 2012 出现故障,数据Oracleо库实例的asmb进程出现故障,导致数据库实例无法跟asm进行交互。

这里补充一下,数据库实例和asm实例都存在asmb进程,我猜测应该是数据库实例的asmb进程和asm实例的asmb进程进行交互,只要其中一个asmb进程出现故障,那么可能都会导致数据库crash。

我们来看上面提到的trace cmsdb31_asmb_1155.trc内容,如下:

/oracle/admin/cmsdb3/bdump/cmsdb31_asmb_1155.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /oracle/product/10.2.0/cmsdb

System name: SunOS

Node name: cmsdb05

Release: 5.10

Version: Generic_137111-01

Machine: sun4u

Instance name: cmsdb31

Redo thread mounted by this instance: 1

Oracle process number: 52

Unix process pid: 1155, image: oracle@cmsdb05 (ASMB)

*** 2012-09-17 10:35:00.281

*** SERVICE NAME:(SYS$BACKGROUND) 2012-09-17 10:35:00.280

*** SESSION ID:(586.1) 2012-09-17 10:35:00.280

error 15064 detected in background process

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

ksuitm: waiting for [5] seconds before killing DIAG

我们看到这个trace基本上没有什么信息,就只有ora-15064和ora-03113错误,这2个错误很好理解,不多说。

后面还有一句ksuitm: waiting for [5] seconds before killing DIAG 其实也值得我们关注。

这个oracle 10g中是通过一个隐含参数来进行控制的:

SQL> show parameter ksu_diag

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

_ksu_diag_kill_time                  integer     5

该参数默认是5s,也就上面错误提到的5s,下面是关于该参数的描述:

number of seconds ksuitm waits before killing diag

这里解释下ksu是什么意思:kernel service user 的简称。所以这里不难猜测ksuitm是一个函数,大概是下面的简称:

kernel service user diag wait time。

也就是说diag进程默认等待超过5s,就会被kill掉。

接着我们来继续看asm实例的alert log,如下:

Mon Sep 17 10:35:15 2012

Starting background process ASMB

ASMB started with pid=22, OS id=599

Mon Sep 17 10:38:21 2012

NOTE: ASMB process exiting due to lack of ASM file activity

从上面信息可以看到,在Mon Sep 17 10:35:15 2012,asm实例启动了asmb进程,这里说明一下,asm实例的asmb进程是需要进行数据交互时,才会启动,否则不会启动,我通过vm asm环境测试发现的。

另外通过分析asm的pmon trace,发现大量如下信息:

/oracle/admin/+ASM/bdump/+asm1_pmon_18359.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /oracle/product/10.2.0/cmsdb

System name: SunOS

Node name: cmsdb05

Release: 5.10

Version: Generic_137111-01

Machine: sun4u

Instance name: +ASM1

Redo thread mounted by this instance: 0

Oracle process number: 2

Unix process pid: 18359, image: oracle@cmsdb05 (PMON)

*** 2012-09-10 13:55:18.344

*** SERVICE NAME:() 2012-09-10 13:55:18.343

*** SESSION ID:(49.1) 2012-09-10 13:55:18.343

[claim lock for dead process][lp 0x3877a7308][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a71a0][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a7050][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a6f00][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a6db0][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a6c60][p 0x3877265d0.0][hist x951]

[claim lock for dead process][lp 0x3877a6b10][p 0x3877265d0.0][hist x951]

通过查询mos发现,这个是一个bug,不过无关紧要,如下描述:

Cause

This is an informational message.

The meaning of "claim lock for dead process" message:

The message "claim lock for dead process" indicates that PMON is cleaning up the dead process and releasing the

locks held by this process.These messages do not correlate to any instance termination.

This message will be logged when sessions/processes are killed manually and PMON is clearing the lock held by

the dead process.

Solution

This is a regular massage and there is no ways to remove or suppress it.

Database and System Administrator should focus on finding why the processes are dying.

If that is expected then they just can ignore the messages included in the PMON trace file.

从前面的信息,我们可以发现,都没有什么具体价值的东西,下面我们来看看diag的trace,看看能不能找到一些蛛丝马迹。

搜索asmb,找到如下信息:

PROCESS 52:

----------------------------------------

SO: 680005f58, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=52, calls cur/top: 68015ba68/68015ba68, flag: (6) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 33

last post received-location: ksrpublish

last process to post me: 741001830 5 0

last post sent: 818 0 4

last post sent-location: kslpsr

last process posted by me: 6800037d0 1 6

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: 5373ab390

O/S info: user: oracle, term: UNKNOWN, ospid: 1155

OSD pid info: Unix process pid: 1155, image: oracle@cmsdb05 (ASMB)

Dump of memory from 0x0000000534329A58 to 0x0000000534329C60

534329A50                   00000005 00000000          [........]

534329A60 00000006 801CFF08 00000010 0003139D  [................]

534329A70 00000006 8015BA68 00000003 0003139D  [.......h........]

534329A80 00000005 3643C6B0 00000013 00031291  [....6C..........]

534329A90 00000006 801BFD58 0000000B 0003139D  [.......X........]

534329AA0 00000006 801482C8 00000004 00031291  [................]

534329AB0 00000000 00000000 00000000 00000000  [................]

Repeat 26 times

----------------------------------------

SO: 6801482c8, type: 4, owner: 680005f58, flag: INIT/-/-/0x00

(session) sid: 586 trans: 0, creator: 680005f58, flag: (51) USR/- BSY/-/-/-/-/-

DID: 0000-0034-00000003, short-term DID: 0000-0000-00000000

txn branch: 0

oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

last wait for 'ASM background timer' blocking sess=0x0 seq=3 wait_time=1310127 seconds since wait started=592766

=0, =0, =0

Dumping Session Wait History

for 'ASM background timer' count=1 wait_time=1310127

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892509

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892503

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892462

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892512

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892454

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892504

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892602

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892541

=0, =0, =0

for 'ASM background timer' count=1 wait_time=4892473

=0, =0, =0

temporary object counter: 0

----------------------------------------

UOL used : 0 locks(used=0, free=0)

KGX Atomic Operation Log 51df73dc0

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 586 efd 0 whr 0 slp 0

KGX Atomic Operation Log 51df73e08

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 586 efd 0 whr 0 slp 0

KGX Atomic Operation Log 51df73e50

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 586 efd 0 whr 0 slp 0

----------------------------------------

SO: 6812679b8, type: 41, owner: 6801482c8, flag: INIT/-/-/0x00

(dummy) nxc=0, nlb=0

----------------------------------------

SO: 6801bfd58, type: 11, owner: 680005f58, flag: INIT/-/-/0x00

(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 680005f58,

event: 24, last message event: 24,

last message waited event: 24, messages read: 0

channel: (5363f4a98) system events broadcast channel

scope: 2, event: 8814, last mesage event: 33,

publishers/subscribers: 0/201,

messages published: 1

----------------------------------------

SO: 53643c6b0, type: 19, owner: 680005f58, flag: INIT/-/-/0x00

GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=0

outq=0 sndq=0 opid=0 prmb=0x0

mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)

fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)

mop[s]=0 mop[q]=0 pendq=0 zmbq=0

------------process 0x53643c6b0--------------------

proc version      : 0

Local node        : 0

pid               : 1155

lkp_node          : 0

svr_mode          : 0

proc state        : KJP_FROZEN

Last drm hb acked : 0

Total accesses    : 1

Imm.  accesses    : 0

Locks on ASTQ     : 0

Locks Pending AST : 0

Granted locks     : 0

AST_Q:

PENDING_Q:

GRANTED_Q:

----------------------------------------

SO: 68015ba68, type: 3, owner: 680005f58, flag: INIT/-/-/0x00

(call) sess: cur 6801482c8, rec 0, usr 6801482c8; depth: 0

----------------------------------------

SO: 6812a5b90, type: 84, owner: 68015ba68, flag: INIT/-/-/0x00

(kfgso) flags: 00000000 clt: 3 err: 0 hint: 0

(kfgpn) rpi: 1 itrn:0 gst:0 usrp:0

busy: 0 rep: 0 grp: 0 check: 0/0 glink: 812a5c10 812a5c10

----------------------------------------

SO: 6801cff08, type: 16, owner: 680005f58, flag: INIT/-/-/0x00

(osp req holder)

似乎没有什么异常,ASM background timer是一个idle event,虽然wait time挺长的,但是这里是一个累积值。

另外看下了diag进程的信息,也没有什么特别的,如下:

PROCESS 8:

----------------------------------------

SO: 680000860, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=8, calls cur/top: 68015bfe8/68015bfe8, flag: (2) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 0

last post received-location: No post

last process to post me: none

last post sent: 0 0 48

last post sent-location: ksoreq_reply

last process posted by me: 800058f98 1 0

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: 5373ab390

O/S info: user: oracle, term: UNKNOWN, ospid: 639

OSD pid info: Unix process pid: 639, image: oracle@cmsdb05 (DIAG)

Dump of memory from 0x0000000534328400 to 0x0000000534328608

534328400 00000005 00000000 00000006 801D0018  [................]

534328410 00000010 0003139D 00000006 8015BFE8  [................]

534328420 00000003 0003139D 00000006 801C0088  [................]

534328430 0000000B 0003139D 00000006 80158FB8  [................]

534328440 00000004 00031291 00000005 3643D520  [............6C. ]

534328450 00000013 00031291 00000000 00000000  [................]

534328460 00000000 00000000 00000000 00000000  [................]

Repeat 25 times

534328600 00000000 00000000                    [........]

----------------------------------------

SO: 53643d520, type: 19, owner: 680000860, flag: INIT/-/-/0x00

GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=0

outq=0 sndq=0 opid=0 prmb=0x0

mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)

fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)

mop[s]=0 mop[q]=0 pendq=0 zmbq=0

------------process 0x53643d520--------------------

proc version      : 0

Local node        : 0

pid               : 639

lkp_node          : 0

svr_mode          : 0

proc state        : KJP_FROZEN

Last drm hb acked : 0

Total accesses    : 1

Imm.  accesses    : 0

Locks on ASTQ     : 0

Locks Pending AST : 0

Granted locks     : 0

AST_Q:

PENDING_Q:

GRANTED_Q:

----------------------------------------

SO: 680158fb8, type: 4, owner: 680000860, flag: INIT/-/-/0x00

(session) sid: 599 trans: 0, creator: 680000860, flag: (51) USR/- BSY/-/-/-/-/-

DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000

txn branch: 0

oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS

last wait for 'DIAG idle wait' blocking sess=0x0 seq=3 wait_time=136913 seconds since wait started=592763

component=1, where=1, wait time(millisec)=c8

Dumping Session Wait History

for 'DIAG idle wait' count=1 wait_time=136913

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=309499

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205062

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205050

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=204984

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205079

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205032

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205052

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=205057

component=1, where=1, wait time(millisec)=c8

for 'DIAG idle wait' count=1 wait_time=204994

component=1, where=1, wait time(millisec)=c8

temporary object counter: 0

----------------------------------------

UOL used : 0 locks(used=0, free=0)

KGX Atomic Operation Log 52cffe850

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 599 efd 0 whr 0 slp 0

KGX Atomic Operation Log 52cffe898

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 599 efd 0 whr 0 slp 0

KGX Atomic Operation Log 52cffe8e0

Mutex 0(0, 0) idn 0 oper NONE

Library Cache uid 599 efd 0 whr 0 slp 0

----------------------------------------

SO: 681267698, type: 41, owner: 680158fb8, flag: INIT/-/-/0x00

(dummy) nxc=0, nlb=0

----------------------------------------

SO: 68015bfe8, type: 3, owner: 680000860, flag: INIT/-/-/0x00

(call) sess: cur 680158fb8, rec 0, usr 680158fb8; depth: 0

----------------------------------------

SO: 6801d0018, type: 16, owner: 680000860, flag: INIT/-/-/0x00

(osp req holder)

可以看到diag 的等待时间确实挺长的,以至于该进程被kill。

通过strace 跟踪asm和数据库实例的asmb进程发现都是如下类似的几个函数,没有别的:

% time     seconds  usecs/call     calls    errors syscall

------ ----------- ----------- --------- --------- ----------------

nan    0.000000           0        22           read

nan    0.000000           0        22           write

nan    0.000000           0        66           gettimeofday

------ ----------- ----------- --------- --------- ----------------

100.00    0.000000                   110           total

但是通过查看/proc/pid/fd/下面的信息,发现应该就是asmb 2个进程进行数据交互的:

--datbase instance

lr-x------ 1 oracle oinstall 64 Sep 26 08:35 23 -> pipe:[141579]

l-wx------ 1 oracle oinstall 64 Sep 26 08:35 22 -> pipe:[141578]

lrwx------ 1 oracle oinstall 64 Sep 26 08:35 20 -> socket:[141577]

---asm instance

lr-x------ 1 oracle oinstall 64 Sep 26 08:40 22 -> pipe:[142437]

l-wx------ 1 oracle oinstall 64 Sep 26 08:40 21 -> pipe:[142436]

lr-x------ 1 oracle oinstall 64 Sep 26 08:40 2 -> /dev/null

lrwx------ 1 oracle oinstall 64 Sep 26 08:40 19 -> socket:[142435]

不过至于具体是怎么交互,就不清楚了。

从diag trace看也没有发现任何的死进程,如下:

Orapids on dead process list: [count = 0]

综上所掌握的信息来看,就是数据库实例的asmb进程crash,然后导致数据库实例crash,进而被crs给重启了。

到最后我们都没有搞清楚数据库实例的asmb进程为何会crash 掉。 从数据库alert log看,故障前6天都是正常的,alert中只有redo切换信息,而且并不频繁,所以最后我不得不再次怀疑是oracle bug了。

我记得该客户除了这3套10.2.0.3 rac(asm)外,还有另外2套10.2.0.4、10.2.0.5的asm rac,几乎从来没有出现过问题。

最后还是没有找到根本原因,不知道是不是应该直接建议该客户直接升级得了。

mos有个bug 有点沾边,虽然是说的11.1.0.7,但是以下版本可能也会受影响,但是关于该bug的具体信息看不到,比较遗憾。

Bug 9246245 – Instance crash due to ORA-15064 in RAC [ID 9246245.8]

另外一种猜测就是有可能当前系统负载过高,可能触发什么bug之类的,不过这些都是猜测,总的来说,bug的可能性是非常之大的。

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle报错ORA-03113 ORA-15064产生原因和解决办法

9bd101509341196819122f36086c9a60.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值