记一次linux内存耗尽导致的DataBase故障

原创已于 2024-10-06 20:41:44 修改 · 1.9k 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#linux #数据库 #运维 #服务器 #oracle #dba

于 2024-09-06 16:50:21 首次发布

DB及服务器Trouble-Shooting 专栏收录该内容

13 篇文章

订阅专栏

List item

一.报错发生过程

1.告警发生：在2024.09.02 09:04:48出现告警Linux: High memory utilization (>90% for 5m)
查看发现Linux: High memory utilization (>90% for 5m):当前状态90.96 %
在这里插入图片描述
2.top查看内存占用高达91%，随后95%

3.查找原因：free -h发现buff/cache占用的内存过高，free接近于0，available接近于0；
在这里插入图片描述
4.临时强制释放buff/cache内存：sync; echo 3 > /proc/sys/vm/drop_caches（注意：必须先输入sync并执行完毕后再输入echo 3 > /proc/sys/vm/drop_caches，否则会丢失数据）

5.次日早上07:40，接到电话，灾难发生：业务处理全部停止,判断内存耗尽导致DB 实例2卒，一句WTF随口而出。
打开电脑一看，连Zabbix都已经死翘翘了：
在这里插入图片描述
6.登录发现2号节点已经登录不上去，随后登入1号节点：
检查集群，发现集群竟然还是好的：

7.检查监听，发现监听也还是OK的，又一句WTF脱口而出（其实实例2知识奄奄一息啦,尚未死绝）

6.所幸是Oracle Rac集群，然后迅速以单身30年的手速将2号节点的所有应用切换至1号节点，立马重启reboot 2号服务器。

二：问题排查：

既然已经临时恢复所有业务，终于有时间来个仔细排查：
1.既然是内存耗尽，自然首先想到的是查看linux系统log，于是乎root身份登录：
cd /var/log:
在这里插入图片描述
more /var/log/messages

tail -500f /var/log/messages

崩溃，日志太多了，于是乎egrep上场：
egrep -i error /var/log/messages | more

发现报错全部是关闭2号服务器期间的报错，发生问题期间服务器并无异常，无功而返。。。WTF

2.既然服务器本身并无异常，接下来该到DB了，于是乎，grid/oracle用户登入
限于篇幅，一些非关键过程仅列出使用的命令：
在这里插入图片描述
检查备份：sqlplus / as sysdba

接下来查看oracle DB alert日志：
cd /oracle/app/oracle/diag/rdbms/mcsdb/mcsdb2/trace
在这里插入图片描述
!]

确认log发现从09-02凌晨02:00开始陆续就报错；

09-02 23:00开始os failure message:No buffer space available;

在这里插入图片描述

从09-02 23:00到09-03 07:00 处于奄奄一息状态，
在这里插入图片描述
09-03 07:30基本死翹翹了，

直到reboot 2号服务器后，实例2自动启动；
结合服务器内存耗尽，再加上oracle DB报错，最终锁定是这个原因导致DB挂：

查询oracle官方文档发现解决方案：
ORA-27301: OS failure message: No buffer space available
Oracle Linux: ORA-27301:OS Failure Message: No Buffer Space Available (Doc ID 2041723.1)
发生这种情况的原因是可用于网络缓冲区预留的空间太小。
oracle官方文档解释如下：
在这里插入图片描述
接下来调整相关参数：
一修改NTU（root）
1.vi /etc/sysconfig/network-scripts/ifcfg-lo
2.MTU=16436
3.systemctl restart network
二、修改系统参数
设定 vm.min_free_kbytes 参数为当前服务器总物理内存的0.4%
例如本机内存大小为128G，则配置参数大小为128G*0.4%=512M=512000；

vi /etc/sysctl.conf
vm.min_free_kbytes = 512000
root# sysctl -p

三.那么到此问题就算彻底解决了吗。

No。
笔者认为至少还有2个问题需要调查并彻底解决后此问题才算闭环：
1.一般情况下，服务器内存会保持在合理范围内波动，本机为什么会一直增长？
2.linux的buff/cache此部分内存主要是用来提高系统性能的，比如缓冲一些经常会查询到的热数据，当系统内存吃紧或者有大的文件或者进程申请内存时候，会自动释放，那么为啥本机不但不释放，反而一直增长，直到耗尽available部分的内存？

废话不说了，开始接着调查吧：

1.首先通过检查zabbix系统发现，本机（节点2）内存平时都在50%以下，自2024-08-29 08:10就开始有了明显的增长了：
在这里插入图片描述
同步查看本机群中的节点1，则一直正常，（波动是由于业务量变化引起的正常波动）：

而且，最近一个多月来，70% AP都是在节点1处理的。
那么，问题出在哪里？
节点1和节点2的平时AP基本是负载均衡，节点1处理的事务消息多一些，不过有个Report报表是单独部署在节点2的，还有平时用户UI也是连接节点2；
那么，08-29 08:10在节点2上到底发生了什么?
使用AP用戶登入：
在这里插入图片描述
查看备份和相关的定时任务：

可见每天的08:10有日志压缩的定时任务执行：

好像问题逐渐明朗了，
那么，为什么buff/cache这部分内存不释放？
这个得从linux系统本身入手来调查：
linux内存相关的最关键的参数在这里：/etc/sysctl.conf
在这里插入图片描述
发现图中两个非常重要的参数dirty_ratio与dirty_background_ratio配置不对，这两个参数的作用是：
**vm.dirty_background_ratio:**这个参数指定了当文件系统缓存脏页数量达到系统内存百分之多少时（如5%）就会触发pdflush/flush/kdmflush等后台回写进程运行，将一定缓存的脏页异步地刷入外存；
**vm.dirty_ratio:**这个参数则指定了当文件系统缓存脏页数量达到系统内存百分之多少时（如10%），系统不得不开始处理缓存脏页（因为此时脏页数量已经比较多，为了避免数据丢失需要将一定脏页刷入外存）；在此过程中很多应用进程可能会因为系统转而处理文件IO而阻塞。
不要错误的以为dirty_ratio的触发条件不可能达到，因为每次肯定会先达到vm.dirty_background_ratio的条件。确实是先达到vm.dirty_background_ratio的条件然后触发flush进程进行异步的回写操作，但是这一过程中应用进程仍然可以进行写操作，如果多个应用进程写入的量大于flush进程刷出的量那自然会达到vm.dirty_ratio这个参数所设定的坎，此时操作系统会转入同步地处理脏页的过程，阻塞应用进程。
比如本文故障的2号节点，vm.dirty_ratio设置为80，本机物理内存125GB（128GB），意味着当buff/Cache达到125*80%=100GB时linux内核才会强制释放buff/cache，将其中的脏数据/热数据写回外存，而查看DB 的SGA：为40GB，已经占去40GB，所以内存不崩溃才会见鬼。
在这里插入图片描述

四. 看来问题的根源找到了，那么既然这两个参数如此主要，应该设置多少合适呢

1.先来看看某AI的回答吧：
**对于 Oracle 12，vm.dirty_background_ratio和vm.dirty_ratio的设置同样没有绝对的最优值，因为这取决于很多因素，如系统硬件配置、工作负载特点、存储性能等。
一般来说，较为常见的建议值如下：
vm.dirty_background_ratio通常设置在 5% 到 10% 左右。这个参数控制当脏页占系统内存的比例达到这个值时，内核会启动后台进程将脏页异步写回磁盘，从而避免脏页比例过高影响系统性能。
vm.dirty_ratio可以设置在 20% 到 40% 之间。当系统中脏页的比例达到这个值时，系统会开始将脏页同步写回磁盘，可能会对系统性能产生较大影响。
然而，实际的最优设置需要通过性能测试和监控来确定。你可以从这些建议值开始，观察系统的磁盘 I/O 情况、数据库性能等指标，逐步调整这些参数以找到最适合你的环境的设置。
需要注意的是，在调整内核参数时要谨慎操作，确保在进行任何重大更改之前备份重要数据和配置文件，并参考 Oracle 的官方文档和最佳实践以获取更具体的建议。
在这里插入图片描述

以下是Oracle官方文档关于这两个参数的说明：
在这里插入图片描述
在此也引用下一篇外网博客，供大家参考：
Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio
by BOB PLANKERS on DECEMBER 22, 2013
in BEST PRACTICES,CLOUD,SYSTEM ADMINISTRATION,VIRTUALIZATION
This is post #16 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

In previous posts on vm.swappiness and using RAM disks we talked about how the memory on a Linux guest is used for the OS itself (the kernel, buffers, etc.), applications, and also for file cache. File caching is an important performance improvement, and read caching is a clear win in most cases, balanced against applications using the RAM directly. Write caching is trickier. The Linux kernel stages disk writes into cache, and over time asynchronously flushes them to disk. This has a nice effect of speeding disk I/O but it is risky. When data isn’t written to disk there is an increased chance of losing it.

There is also the chance that a lot of I/O will overwhelm the cache, too. Ever written a lot of data to disk all at once, and seen large pauses on the system while it tries to deal with all that data? Those pauses are a result of the cache deciding that there’s too much data to be written asynchronously (as a non-blocking background operation, letting the application process continue), and switches to writing synchronously (blocking and making the process wait until the I/O is committed to disk). Of course, a filesystem also has to preserve write order, so when it starts writing synchronously it first has to destage the cache. Hence the long pause.

The nice thing is that these are controllable options, and based on your workloads & data you can decide how you want to set them up. Let’s take a look:

$ sysctl -a | grep dirty vm.dirty_background_ratio = 10 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 3000
vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

vm.dirty_background_bytes and vm.dirty_bytes are another way to specify these parameters. If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

You can also see statistics on the page cache in /proc/vmstat:

$ cat /proc/vmstat | egrep “dirty|writeback” nr_dirty 878 nr_writeback 0 nr_writeback_temp 0
In my case I have 878 dirty pages waiting to be written to disk.

Approach 1: Decreasing the Cache
As with most things in the computer world, how you adjust these depends on what you’re trying to do. In many cases we have fast disk subsystems with their own big, battery-backed NVRAM caches, so keeping things in the OS page cache is risky. Let’s try to send I/O to the array in a more timely fashion and reduce the chance our local OS will, to borrow a phrase from the service industry, be “in the weeds.” To do this we lower vm.dirty_background_ratio and vm.dirty_ratio by adding new numbers to /etc/sysctl.conf and reloading with “sysctl –p”:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 10
This is a typical approach on virtual machines, as well as Linux-based hypervisors. I wouldn’t suggest setting these parameters to zero, as some background I/O is nice to decouple application performance from short periods of higher latency on your disk array & SAN (“spikes”).

Approach 2: Increasing the Cache
There are scenarios where raising the cache dramatically has positive effects on performance. These situations are where the data contained on a Linux guest isn’t critical and can be lost, and usually where an application is writing to the same files repeatedly or in repeatable bursts. In theory, by allowing more dirty pages to exist in memory you’ll rewrite the same blocks over and over in cache, and just need to do one write every so often to the actual disk. To do this we raise the parameters:

vm.dirty_background_ratio = 50 vm.dirty_ratio = 80
Sometimes folks also increase the vm.dirty_expire_centisecs parameter to allow more time in cache. Beyond the increased risk of data loss, you also run the risk of long I/O pauses if that cache gets full and needs to destage, because on large VMs there will be a lot of data in cache.

Approach 3: Both Ways
There are also scenarios where a system has to deal with infrequent, bursty traffic to slow disk (batch jobs at the top of the hour, midnight, writing to an SD card on a Raspberry Pi, etc.). In that case an approach might be to allow all that write I/O to be deposited in the cache so that the background flush operations can deal with it asynchronously over time:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 80
Here the background processes will start writing right away when it hits that 5% ceiling but the system won’t force synchronous I/O until it gets to 80% full. From there you just size your system RAM and vm.dirty_ratio to be able to consume all the written data. Again, there are tradeoffs with data consistency on disk, which translates into risk to data. Buy a UPS and make sure you can destage cache before the UPS runs out of power. 😃

No matter the route you choose you should always be gathering hard data to support your changes and help you determine if you are improving things or making them worse. In this case you can get data from many different places, including the application itself, /proc/vmstat, /proc/meminfo, iostat, vmstat, and many of the things in /proc/sys/vm. Good luck!

综上：对于这两个参数的设置范围如下：
dirty_background_ratio 一般为10-40%；
dirty_ratio：1-10%；

五.linux还有哪些和内存相关的关键参数（/etc/sysctl.conf）：

vm.vfs_cache_pressure (默认值 = 100)
默认值vfs_cache_pressure=100，在回收页高速缓存(page cache)和交换缓存(swap cache)时，内核会以"相对公平"的比例回收dentries和inodes。
调整内核更趋向于回收内存中保存的目录项缓存(dentry)和索引节点对象 (inode objects)。
减少vfs_cache_pressure值，会使内核更倾向于保留目录项对象(dentry)以及索引节点缓存(inode caches); 增加vfs_cache_pressure 超过100 会使内核更倾向于释放目录项对象(dentry)以及索引节点缓存(inode caches)。
可以通过增加vfs_cache_pressure的值，来使内核更倾向于释放上述缓存，从而限制页高速缓存(page cache)的大小。
vm.dirty_background_ratio (默认值 = 10)
此参数的值代表脏页占总内存的百分比，当系统中脏页数量达到此值时，内核线程pdflush开始把脏页数据写入存储。
可以通过减少此值，来使pdflush进程更早把脏页写入存储，从而限制页高速缓存的大小。
vm.dirty_ratio (默认值 = 20)
这个参数则指定了当文件系统缓存脏页数量达到系统内存百分之多少时（默认值20%），系统不得不开始处理缓存脏页（因为此时脏页数量已经比较多，为了避免数据丢失需要将一定脏页刷入外部存储）；在此过程中很多应用进程可能会因为系统转而处理文件IO而阻塞。
减少此值可使系统更早来处理内存中的脏页，从而限制页高速缓存的大小。
vm.dirty_writeback_centisecs (Red Hat Enterprise Linux 4 & 5: 默认值 = 499, Red Hat Enterprise Linux 6 and 7: 默认值 = 500)
pdflush进程会定时被唤醒，把脏页中的数据写入硬盘。单位是 1/100 秒。缺省数值是500，也就是pdflush进程5秒钟会被唤醒一次。
减少此值可以更频繁的唤醒pdflush进程来处理脏页，从而限制页高速缓存的大小。
vm.dirty_expire_centisecs (Red Hat Enterprise Linux 4 and 5: 默认值= 2999, Red Hat Enterprise Linux 6 and 7: 默认值 = 3000)
这个参数声明Linux内核写缓冲区里面的数据多“旧”了之后，被唤醒的pdflush进程就开始考虑写到磁盘中去。单位是 1/100秒。缺省是 30000，也就是 30 秒的数据就算旧了，pdflush进程被唤醒后，将会把“旧”的数据写入磁盘。
减少此值意味着脏页会更快变“旧”，并被pdflush进程写入磁盘，从而限制页高速缓存的大小。
vm.swappiness (RHEL 5 and 6:默认值 = 60, RHEL 7:默认值 = 30)
此参数控制内核是否更趋向于交换非活动内存页页至交换分区(此值越高，代表非活动内存页越可能被交换至交换分区)。
减少此值使内核更倾向于保持非活动内存页在物理内存中，从而释放页高速缓存中的页，从而限制页高速缓存的大小。