Prometheus+Grafana 监控体系搭建：从入门到告警配置

原创于 2025-08-21 11:21:34 发布 · 448 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#prometheus #grafana

Prometheus+Grafana 监控体系搭建：从入门到告警配置

前言

监控是运维的"眼睛"，而Prometheus+Grafana的组合已成为云原生监控的事实标准。本文将手把手带你搭建完整的监控体系：

数据采集：通过Node Exporter暴露主机指标
存储分析：Prometheus实现指标抓取与存储
可视化：Grafana展示炫酷仪表盘
告警：集成企业微信/邮件通知

一、环境准备

服务器：CentOS 7.x / Ubuntu 22.04（1核2GB即可）
组件版本：
- Prometheus 2.45+
- Grafana 10.2+
- Node Exporter 1.6+

二、安装与配置

1. 部署Node Exporter（所有被监控节点）

# 下载并解压  
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz  
tar -xvf node_exporter-*.tar.gz  
cd node_exporter-*/  

# 启动服务（默认端口9100）  
nohup ./node_exporter &  

# 验证指标输出  
curl http://localhost:9100/metrics | grep "cpu_usage"

📌 若需监控Docker容器，额外安装cAdvisor。

2. 安装Prometheus（监控服务器）

# 下载Prometheus  
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz  
tar -xvf prometheus-*.tar.gz  
cd prometheus-*/  

# 编辑配置文件prometheus.yml  
cat <<EOF > prometheus.yml  
global:  
  scrape_interval: 15s  

scrape_configs:  
  - job_name: 'node_exporter'  
    static_configs:  
      - targets: ['192.168.1.100:9100', '192.168.1.101:9100']  # 替换为实际IP  

  - job_name: 'prometheus'  
    static_configs:  
      - targets: ['localhost:9090']  
EOF  

# 启动服务（默认端口9090）  
nohup ./prometheus --config.file=prometheus.yml &

关键配置解析：

scrape_interval：抓取指标频率
targets：支持静态IP列表或服务发现（如K8s DNS）

3. 部署Grafana（可视化）

# Ubuntu/Debian  
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -  
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list  
apt update && apt install -y grafana  

# CentOS/RHEL  
yum install -y https://dl.grafana.com/oss/release/grafana-10.2.0-1.x86_64.rpm  

# 启动服务  
systemctl enable grafana-server  
systemctl start grafana-server

访问http://<服务器IP>:3000，默认账号admin/admin。

三、配置数据源与仪表盘

1. 添加Prometheus数据源

登录Grafana → Configuration → Data Sources → Add data source
选择Prometheus，填写URL：http://localhost:9090
点击 Save & Test

2. 导入官方仪表盘

导航至 Dashboards → Import
输入仪表盘ID 1860（Node Exporter全指标仪表盘）
选择Prometheus数据源 → Import

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传
(示意图：包含CPU/内存/磁盘/网络等关键指标)

四、告警配置（企业微信为例）

1. 配置Alertmanager

# 下载Alertmanager  
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz  

# 编辑配置文件alertmanager.yml  
cat <<EOF > alertmanager.yml  
route:  
  receiver: 'wechat'  
receivers:  
- name: 'wechat'  
  wechat_configs:  
  - corp_id: '企业微信CorpID'  
    to_user: '@all'  
    agent_id: '应用AgentID'  
    api_secret: '应用Secret'  
EOF  

# 启动服务  
nohup ./alertmanager --config.file=alertmanager.yml &

2. Prometheus集成Alertmanager

修改prometheus.yml添加：

alerting:  
  alertmanagers:  
    - static_configs:  
        - targets: ['localhost:9093']  

rule_files:  
  - 'alert_rules.yml'  # 告警规则文件

3. 定义告警规则

创建alert_rules.yml示例：

groups:  
- name: host_stats  
  rules:  
  - alert: HighCPUUsage  
    expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80  
    for: 5m  
    labels:  
      severity: warning  
    annotations:  
      summary: "高CPU使用率 ({{ $value }}%)"  
      description: "实例 {{ $labels.instance }} 的CPU使用率超过80%"

五、常见问题排查

1. Prometheus无法抓取Node Exporter数据

检查防火墙规则：iptables -L -n | grep 9100
验证端点连通性：curl http://<target_ip>:9100/metrics

2. Grafana图表显示"No Data"

确认时间范围设置正确（右上角）
检查PromQL语法：node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

3. 企业微信告警未触发

检查API权限：企业微信应用需开启"发送消息"权限
查看Alertmanager日志：journalctl -u alertmanager -f

六、进阶优化建议

长期存储：集成VictoriaMetrics或Thanos解决Prometheus本地存储限制
服务发现：在K8s环境中使用kubernetes_sd_configs自动发现Pod
自定义指标：通过Client Library暴露业务指标（如Go的promhttp库）

总结

通过本文，你已经完成了：
✅ 主机指标采集
✅ Prometheus数据存储
✅ Grafana可视化看板
✅ 企业微信告警集成

下一步：尝试监控MySQL/Redis等中间件（可参考仪表盘ID 7362/11835）。

附录：常用PromQL速查

-- CPU使用率  
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100  

-- 内存剩余百分比  
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100  

-- 磁盘使用率  
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

标签：#Prometheus #Grafana #监控系统 #运维实战 #告警配置

互动：你在配置监控系统时遇到过哪些坑？欢迎评论区交流！如果需要更详细的某部分内容（如自定义指标暴露），可以留言告诉我~