mysql pointer_MySQL-Innodb-Hazard Pointer的设计与实现

最新推荐文章于 2025-03-06 12:40:57 发布

原创最新推荐文章于 2025-03-06 12:40:57 发布 · 179 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#mysql pointer

本文详细介绍了InnoDB存储引擎在处理刷脏页过程中的性能问题。早期版本的算法可能导致在磁盘速度较慢时，刷脏操作退化为O(N*N)复杂度。为了解决这个问题，引入了Hazard Pointer技术。Hazard Pointer允许在释放锁之前保留对下一个待处理节点的引用，从而在释放锁后仍能快速定位到有效节点，避免重复扫描，保证刷脏操作始终维持在O(N)复杂度，提高了系统的效率。这一优化同样适用于LRU List驱逐算法，提升了整体系统性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前提

flush_list是包含dirty page并按修改时间有序的链表，在刷脏时选择从链表的尾部进行遍历淘汰

异步IO，当异步写page完成后，io helper线程会调buf_flush_write_complete，将写入的Page从flush list上移除。

调用栈

----fil_aio_wait

----buf_page_io_complete(static_cast(message));

----buf_page_io_complete

--------buf_flush_write_complete

-------------buf_flush_remove

-----------------UT_LIST_REMOVE(buf_pool->flush_list, bpage);

以前的代码实现(版本5.7.1):

刷脏线程刷完一个数据页，就需要回到Flush List末尾，重新扫描新的可刷盘的脏页。

buf_do_flush_list_batch{

do {

/* Start from the end of the list looking for a suitable

block to be flushed. */

buf_flush_list_mutex_enter(buf_pool);

/* We use len here because theoretically insertions can

happen in the flush_list below while we are traversing

it for a suitable candidate for flushing. We'd like to

set a limit on how farther we are willing to traverse

the list. */

len = UT_LIST_GET_LEN(buf_pool->flush_list);

bpage = UT_LIST_GET_LAST(buf_pool->flush_list);

if (bpage) {

ut_a(bpage->oldest_modification > 0);

}

if (!bpage || bpage->oldest_modification >= lsn_limit) {

/* We have flushed enough */

buf_flush_list_mutex_exit(buf_pool);

break;

}

ut_a(bpage->oldest_modification > 0);

ut_ad(bpage->in_flush_list);

buf_flush_list_mutex_exit(buf_pool);

/* The list may change during the flushing and we cannot

safely preserve within this function a pointer to a

block in the list! */

while (bpage != NULL

&& len > 0

&& !buf_flush_page_and_try_neighbors(

bpage, BUF_FLUSH_LIST, min_n, &count)) {

++scanned;

buf_flush_list_mutex_enter(buf_pool);

/* If we are here that means that buf_pool->mutex

was not released in buf_flush_page_and_try_neighbors()

above and this guarantees that bpage didn't get

relocated since we released the flush_list

mutex above. There is a chance, however, that

the bpage got removed from flush_list (not

currently possible because flush_list_remove()

also obtains buf_pool mutex but that may change

in future). To avoid this scenario we check

the oldest_modification and if it is zero

we start all over again. */

if (bpage->oldest_modification == 0) {

buf_flush_list_mutex_exit(buf_pool);

break;

}

bpage = UT_LIST_GET_PREV(list, bpage);

ut_ad(!bpage || bpage->in_flush_list);

buf_flush_list_mutex_exit(buf_pool);

--len;

}

} while (count < min_n && bpage != NULL && len > 0);

}

为什么这样设计？

因为刷脏线程刷的这个脏页之前的脏页可能被其他线程给刷走了，之前的脏页可能已经不在Flush list中。

造成这个现象的原因：

用户线程也会进行刷脏，虽然是从lru队列中找到的脏页，当脏页被flush到磁盘后，用户线程也会将写入的Page从flush list上移除。具体的调用栈参见：

https://www.jianshu.com/p/080056afa842

同时，为了减少锁占用的时间，InnoDB在进行写盘的时候都会把之前占用的锁给释放掉。

带来了什么后果

我们的某一个刷脏线程拿到队尾最后一个数据页，IO fixed，发送给IO线程，最后再从队尾扫描寻找可刷盘的脏页。在这次扫描中，它发现最后一个数据页(也就是刚刚发送到IO线程中的数据页)状态为IO fixed(磁盘很慢，还没处理完)所以不能刷，跳过，开始刷倒数第二个数据页，同样IO fixed，发送给IO线程，然后再次重新扫描Flush List。它又发现尾部的两个数据页都不能刷新(因为磁盘很慢，可能还没刷完)，直到扫描到倒数第三个数据页。所以，存在一种极端的情况，如果磁盘比较缓慢，刷脏算法性能会从O(N)退化成O(N*N)。

https://dev.mysql.com/worklog/task/?id=7047

Hazard Pointer版本的代码实现(5.7.28)

static

ulint

buf_do_flush_list_batch(

buf_pool_t* buf_pool,

ulint min_n,

lsn_t lsn_limit)

{

ulint count = 0;

ulint scanned = 0;

ut_ad(buf_pool_mutex_own(buf_pool));

/* Start from the end of the list looking for a suitable

block to be flushed. */

buf_flush_list_mutex_enter(buf_pool);

ulint len = UT_LIST_GET_LEN(buf_pool->flush_list);

/* In order not to degenerate this scan to O(n*n) we attempt

to preserve pointer of previous block in the flush list. To do

so we declare it a hazard pointer. Any thread working on the

flush list must check the hazard pointer and if it is removing

the same block then it must reset it. */

for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->flush_list);

count < min_n && bpage != NULL && len > 0

&& bpage->oldest_modification < lsn_limit;

bpage = buf_pool->flush_hp.get(),

++scanned) {

buf_page_t* prev;

ut_a(bpage->oldest_modification > 0);

ut_ad(bpage->in_flush_list);

prev = UT_LIST_GET_PREV(list, bpage);

buf_pool->flush_hp.set(prev);

buf_flush_list_mutex_exit(buf_pool);

#ifdef UNIV_DEBUG

bool flushed =

#endif /* UNIV_DEBUG */

buf_flush_page_and_try_neighbors(

bpage, BUF_FLUSH_LIST, min_n, &count);

buf_flush_list_mutex_enter(buf_pool);

ut_ad(flushed || buf_pool->flush_hp.is_hp(prev));

--len;

}

buf_pool->flush_hp.set(NULL);

buf_flush_list_mutex_exit(buf_pool);

if (scanned) {

MONITOR_INC_VALUE_CUMULATIVE(

MONITOR_FLUSH_BATCH_SCANNED,

MONITOR_FLUSH_BATCH_SCANNED_NUM_CALL,

MONITOR_FLUSH_BATCH_SCANNED_PER_CALL,

scanned);

}

if (count) {

MONITOR_INC_VALUE_CUMULATIVE(

MONITOR_FLUSH_BATCH_TOTAL_PAGE,

MONITOR_FLUSH_BATCH_COUNT,

MONITOR_FLUSH_BATCH_PAGES,

count);

}

ut_ad(buf_pool_mutex_own(buf_pool));

return(count);

}

原理

要解决这个问题，最本质的方法就是当刷完一个脏页的时候不要每次都从队尾重新扫描。我们可以使用Hazard Pointer来解决，方法如下：遍历找到一个可刷盘的数据页，在锁释放之前，调整Hazard Pointer使之指向Flush List中下一个节点，注意一定要在持有锁的情况下修改。然后释放锁，进行刷盘，刷完盘后，重新获取锁，读取Hazard Pointer并设置下一个节点，然后释放锁，进行刷盘，如此重复。当这个线程在刷盘的时候，另外一个线程需要刷盘，也是通过Hazard Pointer来获取可靠的节点，并重置下一个有效的节点。通过这种机制，保证每次读到的Hazard Pointer是一个有效的Flush List节点，即使磁盘再慢，刷脏算法效率依然是O(N)。这个解法同样可以用到LRU List驱逐算法上，提高驱逐的效率。

http://mysql.taobao.org/monthly/2017/05/01/