记一次内存踩踏事件

原创已于 2025-01-03 09:04:33 修改 · 1.2k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#linux #运维 #服务器

于 2024-10-22 20:36:45 首次发布

简介

内存踩踏是一种现象比较随机，比较隐晦的bug，大多数时候都较难复现和排查。此次内存踩踏十分幸运发生在静态数据区且踩踏位置比较固定，因此才得以顺利的正向解决。

事件起因

在移植第三方gps解析库之后，原来的cpp程序出现了segment fault信号而异常退出，经过进一步挖掘，初步定性为内存踩踏。

1.生成coredump

程序运行在linux设备，要生成coredump文件，需要设置ulimit 参数和 coredump路径。

修改ulimit 参数

/etc/profile文件末尾添加：

ulimit -c unlimited

设置coredump文件为无限大，否则比较大的程序无法生成完整的core文件。

修改coredump存放路径

在/etc/sysctl.conf末尾添加：

kernel.core_pattern = /opt/xxx/log/%e-%t.core

设置core文件的路径和文件名模板。

场景复现

重启系统，重新运行问题代码，复现段错误：

Segmentation fault (core dumped)

解析core文件

可以使用gdb来解析core文件：

gdb <问题程序> <问题程序的core文件>

root@xxx-2-15:/opt/xxx/log# /gdb /xxxapp xxxapp-1594990464.core

#0  0xb6aa594c in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy_chars(char*, char*, char*)
    () from /usr/lib/libstdc++.so.6
#1  0x00055f90 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*> (
    this=0xa21feb78, __beg=0x12d5aa0 <error: Cannot access memory at address 0x12d5aa0>, 
    __end=0x12d5aa1 <error: Cannot access memory at address 0x12d5aa1>)
    at /home/dchen/petalinux/tools/linux-i386/gcc-arm-linux-gnueabi/arm-linux-gnueabihf/include/c++/6.2.1/bits/basic_string.tcc:225
#2  0xb6aa80b4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libstdc++.so.6
#3  0x0004b0b0 in std::operator+<char, std::char_traits<char>, std::allocator<char> > (__lhs=..., __rhs=0x239b0c "/")
    at /home/dchen/petalinux/tools/linux-i386/gcc-arm-linux-gnueabi/arm-linux-gnueabihf/include/c++/6.2.1/bits/basic_string.h:4969
#4  0x000e36f4 in SystemUpdate::getLastInfo (lastVersion=..., lastVersionId=@0x2f6930: 0)
    at /home/dchen/chendu_repo/apd-unifyrepo/packages/application/apdapp/src/function/SystemUpdate.cpp:65

定位段错误的发生地

从调用链中可以看到错误发生在SystemUpdate.cpp:65。经排查此处代码访问一个全局string类。通过gdb断点调试，发现在问题发生前，此string对象的_M_ptr指针已经被篡改，但内容仍在，走查代码发现该对象为常量，不存在非法写入，因此初步判定为内存踩踏。

定位踩踏事件的凶手

由于已经锁定了一个百分百被踩踏的受害者，只要watch这个受害对象即可追溯到凶手。

尝试：

a.重新启动gdb，watch xxxstring

b.run

程序卡住无法启动，经查询才知道，watch需要cpu硬件支持，但是能够watch的长度是有限的，而string对象的长度超出范围，因此无法watch。

改进：

取xxxstring的地址

p &xxxstring

得到地址:0xabcdef

watch *0xabcdef

此时默认将该地址作为4字节变量进行watch。

执行：run

Thread 2 "apdapp" hit Hardware watchpoint 2: SystemUpdate::filePath._M_string_length

Old value = 11
New value = 1
nmea_parser_gsa (tokenizer=0xb6873240, navdata=0x2d5700 <s_nav>)
at /home/dchen/chendu_repo/apd-unifyrepo/packages/application/apdapp/src/function/BDSparse.cpp:374

程序在每次监视到该地址写入动作后都会停止。最终确定，凶手为第三方库的数组越界操作。