【教程】后台监听GPU使用情况并自动记录和画图

小锋学长生活大爆炸

于 2025-07-31 12:17:29 发布

阅读量916

点赞数 27

CC 4.0 BY-SA版权

分类专栏：学习之旅文章标签： python numpy 人工智能 GPU nvidia-smi 内存监控

本文链接：https://blog.csdn.net/sxf1061700625/article/details/149802722

学习之旅专栏收录该内容

223 篇文章

订阅专栏

转载请注明出处：小锋学长生活大爆炸[xfxuezhagn.cn]

如果本文帮助到了你，欢迎[点赞、收藏、关注]哦~

使用效果

脚本说明

本脚本用于自动监控多卡 GPU 的使用状态，当某张 GPU 连续活跃超过设定秒数时开始记录该卡的使用情况，并在其连续空闲超过设定时间后自动停止记录，并生成图像报告。每张 GPU 独立判断、独立记录、独立存储，避免覆盖历史，方便回溯分析。

工作原理概述

脚本运行流程如下：

初始化 GPU 信息：
- 使用 pynvml 获取系统中所有可用 GPU；
- 为每张 GPU 创建独立的 GPUTracker 实例，跟踪其状态。
每秒轮询：
- 获取每张 GPU 的显存使用（MB）和利用率（%）；
- 判断是否处于“活跃”或“空闲”状态，并更新计数器。
启动监听条件：
- 若某张 GPU 连续 ACTIVE_THRESHOLD 秒活跃（如利用率 > 1%），则：
  - 为该 GPU 创建一个基于当前时间戳的独立目录；
  - 开始记录日志。
记录内容：
- 每秒记录一次：时间戳、显存使用、总显存、GPU 利用率。
停止监听条件：
- 若某 GPU 连续 INACTIVE_THRESHOLD 秒空闲（如利用率为 0），则：
  - 停止日志记录；
  - 自动解析该 GPU 的日志；
  - 绘制并保存：
    - 显存占用曲线图（mem_plot.png）；
    - GPU 利用率曲线图（util_plot.png）。
异常退出处理：
- 保证监听中的 GPU 文件被安全关闭；
- 保证 nvmlShutdown() 被调用，释放资源。

输出目录结构

所有输出统一保存在 gpu_logs/ 下，结构如下：

gpu_logs/
├── gpu_0/
│ ├── 2025-07-31_15-42-10/
│ │ ├── log.txt ← 原始记录数据
│ │ ├── mem_plot.png ← 显存占用图
│ │ └── util_plot.png ← 利用率曲线图
│ └── ...
├── gpu_1/
│ └── ...
...

可配置参数

在脚本顶部设置以下参数：

参数名	默认值	说明
`MIN_UTIL_THRESHOLD`	`1`	利用率大于该值才认为 GPU 活跃（%）
`MIN_MEMORY_THRESHOLD`	`100`	显存使用大于该值才认为 GPU 活跃（MB，当前未启用，可手动添加条件）
`ACTIVE_THRESHOLD`	`5`	连续活跃秒数达到该值时，触发开始记录
`INACTIVE_THRESHOLD`	`10`	连续空闲秒数达到该值时，触发停止记录
`LOG_DIR`	`"gpu_logs"`	所有日志与图表的根目录

监听代码

import os
import time
import re
import matplotlib.pyplot as plt
from datetime import datetime
from pynvml import *

# ===== 配置参数 =====
MIN_UTIL_THRESHOLD = 1          # 最低利用率百分比
MIN_MEMORY_THRESHOLD = 100      # 最低显存使用 MB（未启用）
ACTIVE_THRESHOLD = 5            # 启动记录前连续活跃秒数
INACTIVE_THRESHOLD = 10         # 停止记录前连续空闲秒数
LOG_DIR = "gpu_logs"            # 日志总目录


# ===== 工具函数 =====
def get_gpu_status(handle):
    mem_info = nvmlDeviceGetMemoryInfo(handle)
    util = nvmlDeviceGetUtilizationRates(handle)
    used_mb = mem_info.used / 1024 ** 2
    total_mb = mem_info.total / 1024 ** 2
    return used_mb, total_mb, util.gpu

def is_gpu_active(util_percent, mem_used_mb):
    return util_percent > MIN_UTIL_THRESHOLD
    # or mem_used_mb > MIN_MEMORY_THRESHOLD


def parse_log_file(log_path):
    timestamps, mem_usage, utils = [], [], []
    with open(log_path, "r") as f:
        for line in f:
            match = re.match(r"^(.*?) - GPU (\d+): ([\d.]+)/([\d.]+) MB, Utilization: (\d+)%$", line.strip())
            if match:
                timestamp_str = match.group(1)
                mem_used = float(match.group(3))
                util = int(match.group(5))
                timestamps.append(datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S"))
                mem_usage.append(mem_used)
                utils.append(util)
            else:
                print(f"[Warning] Failed to parse line: {line.strip()}")
    return timestamps, mem_usage, utils

def plot_gpu_log(log_path, mem_plot_path, util_plot_path):
    timestamps, mem_usage, utils = parse_log_file(log_path)
    if not timestamps:
        print(f"[Info] No data to plot for {log_path}")
        return

    # 显存图
    plt.figure(figsize=(10, 4))
    plt.plot(timestamps, mem_usage, label="Memory Usage (MB)")
    plt.xlabel("Time")
    plt.ylabel("Memory (MB)")
    plt.title("GPU Memory Usage")
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(mem_plot_path)
    plt.close()
    print(f"[Saved] {mem_plot_path}")

    # 利用率图
    plt.figure(figsize=(10, 4))
    plt.plot(timestamps, utils, label="GPU Utilization (%)", color="orange")
    plt.xlabel("Time")
    plt.ylabel("Utilization (%)")
    plt.title("GPU Utilization")
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(util_plot_path)
    plt.close()
    print(f"[Saved] {util_plot_path}")


# ===== GPU 监听对象封装 =====
class GPUTracker:
    def __init__(self, index, handle):
        self.index = index
        self.handle = handle
        self.active_count = 0
        self.inactive_count = 0
        self.monitoring = False
        self.log_file = None
        self.log_path = None
        self.log_dir = None

    def check_status_and_update(self):
        used, total, util = get_gpu_status(self.handle)

        if is_gpu_active(util, used):
            self.active_count += 1
            self.inactive_count = 0
        else:
            self.inactive_count += 1
            self.active_count = 0

        timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
        if self.monitoring:
            log_line = f"{timestamp} - GPU {self.index}: {used:.1f}/{total:.1f} MB, Utilization: {util}%"
            print(log_line)
            self.log_file.write(log_line + "\n")
            self.log_file.flush()

        return util, used

    def start_monitoring(self):
        timestamp = time.strftime("%Y-%m-%d_%H-%M-%S")
        self.log_dir = os.path.join(LOG_DIR, f"gpu_{self.index}", timestamp)
        os.makedirs(self.log_dir, exist_ok=True)
        self.log_path = os.path.join(self.log_dir, "log.txt")
        self.log_file = open(self.log_path, "w")
        self.monitoring = True
        print(f"[Start] GPU {self.index} active. Logging to {self.log_path}")

    def stop_monitoring(self):
        if self.log_file:
            self.log_file.close()
        self.monitoring = False
        print(f"[Stop] GPU {self.index} idle. Finalizing logs...")

        # 输出图
        mem_plot_path = os.path.join(self.log_dir, "mem_plot.png")
        util_plot_path = os.path.join(self.log_dir, "util_plot.png")
        plot_gpu_log(self.log_path, mem_plot_path, util_plot_path)


# ===== 主函数 =====
def main():
    nvmlInit()
    os.makedirs(LOG_DIR, exist_ok=True)
    device_count = nvmlDeviceGetCount()
    print(f"[Info] Detected {device_count} GPU(s).")

    gpus = [GPUTracker(i, nvmlDeviceGetHandleByIndex(i)) for i in range(device_count)]

    try:
        while True:
            for gpu in gpus:
                util, used = gpu.check_status_and_update()

                if not gpu.monitoring and gpu.active_count >= ACTIVE_THRESHOLD:
                    gpu.start_monitoring()

                if gpu.monitoring and gpu.inactive_count >= INACTIVE_THRESHOLD:
                    gpu.stop_monitoring()

            time.sleep(1)
    finally:
        for gpu in gpus:
            if gpu.monitoring:
                gpu.stop_monitoring()
        nvmlShutdown()


if __name__ == "__main__":
    main()