IPC Send timeout detected. Receiver ospid
IPC Send timeout 是 Oracle10g Rac中非常让人头痛的一个问题,在资源紧张、网络拥堵等情况下,就有可能发生IPC超时的问题,而RAC随后就会将问题节点驱逐,引发一轮重新配置。
可喜的是Metalink上针对10.2.0.3有了一个Patch可以修正,而且在10.2.0.4中彻底修正了该问题。
常见的错误提示是这样的:
Thu Nov 27 11:32:05 2008
IPC Send timeout detected. Receiver ospid 4001974
Thu Nov 27 11:33:08 2008
Trace dumping is performing id=[cdmp_20081127113236]
Thu Nov 27 11:34:37 2008
Errors in file /oracle/app/product/admin/srs/bdump/srs1_lms1_4001974.trc:
Thu Nov 27 11:34:38 2008
Errors in file /oracle/app/product/admin/srs/bdump/srs1_lmon_3977348.trc:
ORA-29740: evicted by member 1, group incarnation 32
Thu Nov 27 11:34:38 2008
LMON: terminating instance due to error 29740
这个BUG号是Bug 5190596
10.2.0.3的确常有这个问题,而10.2.0.4却很少看到
在多节点的集群日志中,经常会看到一些IPC Send timeout detected的报错,有的时候是偶尔报一次,数据库层面也看不出什么异常,如果一段时间大量出现了IPC Send timeout detected的报错,可能会导致实例被驱逐。
下面看一段因为IPC Send timeout detected超时导致实例被驱逐的日志
2019-07-09T04:00:56.552363+08:00
IPC Send timeout detected. Sender: ospid 330801 [oracle@BB001 (PING)] <=====IPC Send timeout detected。Sender是实例的PING进程
Receiver: inst 4 binc 8 ospid 129583 <=====接收者是实例4的 ospid 129583进程
2019-07-09T04:04:10.694155+08:00
SGSGXDB1PDB(3):minact-scn: got error during useg scan e:12751 usn:11
SGSGXDB1PDB(3):minact-scn: useg scan erroring out with error e:12751
2019-07-09T04:06:20.571349+08:00
IPC Send timeout detected. Sender: ospid 331082 [oracle@BB001 (LCK0)]
Receiver: inst 4 binc 8 ospid 129871
2019-07-09T04:06:20.648844+08:00
Communications reconfiguration: instance_number 4 by ospid 331082
2019-07-09T04:06:32.085167+08:00
LMS1 (ospid: 330817) has detected no messaging activity from instance 4
USER (ospid: 330817) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2019-07-09T04:06:32.180821+08:00
LMON (ospid: 330811) drops the IMR request from LMS1 (ospid: 330817) because IMR is in progress and inst 4 is marked bad.
2019-07-09T04:06:49.447783+08:00
Evicting instance 4 from cluster <======实例4开始被驱逐
Waiting for instances to leave: 4
2019-07-09T04:07:00.267690+08:00
Dumping diagnostic data in directory=[cdmp_20190709040056], requested by (instance=4, osid=129674 (MMON)), summary=[abnormal instance termination].
2019-07-09T04:07:08.853865+08:00
LMON received an instance eviction notification from instance 3
The instance eviction reason is 0x20000000
The instance eviction map is 4
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 4
2019-07-09T04:07:09.508223+08:00
Remote instance kill is issued with system inc 10
Remote instance kill map (size 1) : 4
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 4
2019-07-09T04:07:12.900020+08:00
Reconfiguration started (old inc 8, new inc 12) <======开始做Reconfiguration
List of instances (total 3) :
1 2 3
Dead instances (total 1) :
4
My inst 1
在MOS查询IPC Send timeout detected报错相关的文档有如下
IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)
Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)
根据Doc ID 2008933.1的描述
While this is happening, “netstat” shows huge jump of “packet reassembles failed”:
查询netstat可以看到有大量的packet reassembles failed报错,如下:
[oracle@BB001 oswnetstat]$ grep -ni 'packet reassembles failed' BB001_netstat_19.07.09.0400.dat
0508971 packet reassembles failed <======有问题时存在大量的packet reassembles failed
0061284 packet reassembles failed
0595953 packet reassembles failed
0939239 packet reassembles failed
0942727 packet reassembles failed
0946213 packet reassembles failed
40949641 packet reassembles failed
40953041 packet reassembles failed
40956356 packet reassembles failed
40959776 packet reassembles failed
40963302 packet reassembles failed
40966844 packet reassembles failed
40970388 packet reassembles failed
40973952 packet reassembles failed
40977485 packet reassembles failed
40981039 packet reassembles failed
40984542 packet reassembles failed
40988124 packet reassembles failed
40991680 packet reassembles failed
40995193 packet reassembles failed
40998705 packet reassembles failed
41002205 packet reassembles failed
41005754 packet reassembles failed
41009364 packet reassembles failed
41012851 packet reassembles failed
41016349 packet reassembles failed
41019855 packet reassembles failed
41023392 packet reassembles failed
41026922 packet reassembles failed
41030468 packet reassembles failed
41033983 packet reassembles failed
41037427 packet reassembles failed
41040896 packet reassembles failed
41044446 packet reassembles failed
41047890 packet reassembles failed
41051359 packet reassembles failed
41054771 packet reassembles failed
41057762