ASM 内存不足引发核心数据库故障分析与解决方案

背景概述

2月9日,某客户核心数据库因为ASM内存不足导致ASM实例开始报错。这是一个较为典型的案例,希望能给朋友们带来参考建议。

问题详细描述

某用户核心数据库在02月09日,因为数据库ASM内存不足导致ASM实例开始报错:ORA-04031 检查发现ASM的SGA_MAX_SIZE只设置了340M,该值在普通的RAC环境中使用足够,当系统繁忙或者空间压力较大的时候就会导致4031内存不足错误。经过讨论觉定增加ASM内存空间。在凌晨增加ASM空间过程中,因命中bug导致数据库没有在第一时间启动,在经过半个小时左右的排查问题后,确认bug并且调整参数后,数据库正常启动。

重大事件支持细节

故障期间,由于我们需要设置SGA_MAX_SIZE的大小,我们对ASM实例进行了如下参数的设置:

alter system set SGA_MAX_SIZE=4G scope=spfile sid='*';

在设置该值后,我们对二节点集群进行重启,重启过程中,集群无法启动,报错如下:
集群alert日志

   2017-02-10 01:56:29.990
[/oracle/grid/crs_1/bin/orarootagent.bin(6750520)]CRS-5018:(:CLSN00037:) Removed unused HAIP route:  169.254.255.255 / 255.255.0.0 / 169.254.114.35 / en9
2017-02-10 01:56:32.018
[ctssd(17432812)]CRS-2409:The clock on host newcbs2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2017-02-10 01:56:48.047
[/oracle/grid/crs_1/bin/oraagent.bin(20906100)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/oracle/grid/crs_1/log/newcbs2/agent/ohasd/oraagent_grid/o
raagent_grid.log".
2017-02-10 01:57:09.778
[/oracle/grid/crs_1/bin/oraagent.bin(20906100)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/oracle/grid/crs_1/log/newcbs2/agent/ohasd/oraagent_grid/o
raagent_grid.log".
2017-02-10 01:57:31.802
[/oracle/grid/crs_1/bin/oraagent.bin(20906100)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/oracle/grid/crs_1/log/newcbs2/agent/ohasd/oraagent_grid/o
raagent_grid.log".
2017-02-10 01:57:34.667
[ohasd(12845356)]CRS-2807:Resource 'ora.asm' failed to start automatically.
2017-02-10 01:57:34.667
[ohasd(12845356)]CRS-2807:Resource 'ora.crsd' failed to start automatically.
集群ohasd日志:
2017-02-10 01:56:54.738: [UiServer][7454] {0:0:96} Done for ctx=1138dc7b0
2017-02-10 01:57:10.978: [    AGFW][5912] {0:0:2} Received the reply to the message: RESOURCE_START[ora.asm 1 1] ID 4098:411 from the agent /oracle/grid/crs_1/bin/oraagent_grid
2017-02-10 01:57:10.979: [    AGFW][5912] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.asm 1 1] ID 4098:410
2017-02-10 01:57:10.979: [   CRSPE][7197] {0:0:2} Received reply to action [Start] message ID: 410
2017-02-10 01:57:10.979: [   CRSPE][7197] {0:0:2} Got agent-specific msg: CRS-5017: The resource action "ora.asm start" encountered the following error: 
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","ALTER DISKGROUP ALL MOUNT /*...","SQLA","tmp")
. For details refer to "(:CLSN00107:)" in "/oracle/grid/crs_1/log/newcbs2/agent/ohasd/oraagent_grid/oraagent_grid.log".

2017-02-10 01:57:10.979: [    AGFW][5912] {0:0:2} Received the reply to the message: RESOURCE_START[ora.asm 1 1] ID 4098:411 from the agent /oracle/grid/crs_1/bin/oraagent_grid
2017-02-10 01:57:10.980: [    AGFW][5912] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.asm 1 1] ID 4098:410
2017-02-10 01:57:10.980: [   CRSPE][7197] {0:0:2} Received reply to action [Start] message ID: 410
2017-02-10 01:57:10.980: [   CRSPE][7197] {0:0:2} Start action failed with error code: 2
2017-02-10 01:57:11.145: [    AGFW][5912] {0:0:2} Received the reply to the message: RESOURCE_START[ora.asm 1 1] ID 4098:411 from the agent /oracle/grid/crs_1/bin/oraagent_grid
2017-02-10 01:57:11.146: [    AGFW][5912] {0:0:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.asm 1 1] ID 4098:410
2017-02-10 01:57:11.146: [   CRSPE][7197] {0:0:2} Received reply to action [Start] message ID: 410
2017-02-10 01:57:11.146: [   CRSPE][7197] {0:0:2} RI [ora.asm 1 1] new internal state: [STABLE] old value: [STARTING]
2017-02-10 01:57:11.146: [   CRSPE][7197] {0:0:2} CRS-2674: Start of 'ora.asm' on 'newcbs2' failed
2017-02-10 01:57:11.147: [   CRSPE][7197] {0:0:2} RI [ora.asm 1 1] new internal state: [CLEANING] old value: [STABLE]
2017-02-10 01:57:11.147: [   CRSPE][7197] {0:0:2} Sending message to agfw: id = 417
2017-02-10 01:57:11.148: [   CRSPE][7197] {0:0:2} CRS-2679: Attempting to clean 'ora.asm' on 'newcbs2'

在一节点上我们发现同样的报错,从报错信息上看,很明确的可以定位到ASM实例在启动过程中依旧在报ORA-04031错误,但是ASM实例的内存大小已经被我们设置成32G,足够ASM使用。
我们将ASM的参数文件取出确认参数设置:

*._asm_hbeatiowait=200
+ASM1.asm_diskgroups='DATA'#Manual Mount
+ASM2.asm_diskgroups='DATA'#Manual Mount
*.asm_power_limit=1
*.diagnostic_dest='/oracle/gridbase'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_max_size=4294967296

可以看到ASM的SGA值明确设置到了4G,在经过大量尝试后,asm启动依旧报ORA-04031错误。
由于目标数据库为11.2.0.3版本,我们尝试用MEMORY_TARGET值来控制数据库ASM内存使用,发现在使用MEMORY_TARGET后,数据库可以正常启动,猜测可能存在相关的bug。由于业务紧急,我们将数据库正常启动。

结论及解决方案

首先,该文档只作为故障解释文档,并不能作为完全的故障定性文档,相关的bug还需要进一步的确认,我们只是希望自第一时间先给出具体的故障过程,目前来看,针对该bug 可使用的解决方案只用调整MEMORY_TARGET值可解决该问题。
hhh6.jpg

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值