Prometheus—生产实战(原生)
监控目标:
k8s、Linux、Windows、数据库、中间件
插件:
黑盒(tcp-exporter、black-exporter)、各种exporter
目录结构
一、监控目标Target
监控目标
写法一、file_sd_configs
实际内容不放到主文件, 监控端点放到target目录
- job_name: 'mysql'
file_sd_configs:
- files:
- targets/mysql/*.yaml
refresh_interval: 5m
- targets: ['172.17.12.94:9104','172.17.12.96:9104']
labels:
environment: 'dev'
- targets: ['172.17.12.75:9104','172.17.12.80:9104']
labels:
environment: 'qa'
写法二、static_configs
实际监控端点写入主文件 (❌不推荐, 比较繁琐)
- job_name: 'mysql_dev'
static_configs:
- targets: ['172.17.12.94:9104','172.17.12.96:9104']
labels:
environment: 'dev'
- job_name: 'mysql_qa'
static_configs:
- targets: ['172.17.12.75:9104','172.17.12.80:9104']
labels:
environment: 'qa'
写法三、k8s监控kubernetes_sd_configs
- job_name: 'k8s_pro_cadvisor'
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
api_server: https://192.168.44.245:6443
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: environment
replacement: prod
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
- action: replace
regex: (.*)
source_labels: ["__address__"]
target_label: __address__
replacement: 192.168.44.245:6443
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
regex: (.*)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- source_labels: [__name__]
regex: container_tasks_state|container_fs_(.*)|go_(.*),container_memory_failures_total,container_threads(.*),container_memory_cache,container_memory_failcnt,container_memory_mapped_file,container_memory_max_usage_bytes,container_memory_rss,container_memory_swap,container_spec_(.*)
action: drop
- job_name: 'k8s_pro_kubelet'
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
api_server: https://192.168.44.245:6443
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: environment
replacement: prod
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
- action: replace
regex: (.*)
source_labels: ["__address__"]
target_label: __address__
replacement: 192.168.44.245:6443
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
regex: (.*)
replacement: /api/v1/nodes/${1}/proxy/metrics
metric_relabel_configs:
- source_labels: [__name__]
regex: rest_client_request_duration_seconds_bucket|storage_operation_duration_seconds_bucket|go_(.*)
action: drop
- job_name: 'k8s_pro_exporter'
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: node
api_server: https://192.168.44.245:6443
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: environment
replacement: prod
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
- action: replace
source_labels: ["__meta_kubernetes_node_address_InternalIP"]
target_label: __address__
replacement: ${1}:9100
metric_relabel_configs:
- source_labels: [__name__]
regex: go_(.*)
action: drop
- job_name: 'kube-state-metrics-prod'
kubernetes_sd_configs:
- role: endpoints
api_server: https://192.168.44.245:6443
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/prometheus-2.28.1/token/k8s_pro
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_pod_name]
action: keep
regex: kube-system;kube-state-metrics;.*
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
action: replace
target_label: job
replacement: kube-state-metrics
- source_labels: [__meta_kubernetes_namespace]
target_label: environment
replacement: prod
[root@BETAWS32 prometheus-2.28.1]# cat /opt/prometheus/prometheus-2.28.1/token/k8s_pro
eyJhbxxxxxxU5wQxxxxxUUifQ.exxxxxxxxxxxx
二、配置告警rules
rule_files:
- "rules/*.yaml"
groups:
- name: k8s_pods
rules:
- alert: 容器cpu使用率
expr: sum by (pod,instance,environment) (rate(container_cpu_usage_seconds_total{container!="istio-proxy",container!="POD",namespace=~"beta.*",image!=""}[3m]))/(sum by (pod,instance,environment) (container_spec_cpu_quota{container!="istio-proxy",container!="POD",image!="",namespace=~"beta.*"}) / 100000) > 0.90
for: 0m
labels:
severity: warning
team: operations
annotations:
description: "容器{{ $labels.namespace }}/{{ $labels.pod }} CPU使用率超过90%, 当前使用率{{ $value }}"
summary: "容器{{ $labels.namespace }}/{{ $labels.pod }} CPU使用率超过90%, 当前使用率{{ $value }}"
告警配置
覆盖范围
alertmanger:
1、linux 主机 - cpu - mem - net - volume - io - process
2、windows 主机 - cpu - mem - net - volume - io - process - services - iis
3、consul - servicecheck - mastercheck - agentcheck
4、pod - cpu - mem - io - hpa - deployment - statefulset - daemonset - health - crash - kube certificate
5、envoy - cluster upstream
6、 istio
7、coredns
8、blackbox - http check - check slow - ssl certificate
9、数据库: mysql、redis、sqlserver—没覆盖、mongodb
热更新
curl -X PUT localhost:9090/-/reload
静默
资源
- prometheus.yml
- weixin.tmpl
- alertmanager.yml