告警最佳实践 #

一、告警设计原则 #

1.1 核心原则 #

text

告警设计核心原则：

┌─────────────────────────────────────────────┐
│ 1. 可操作性                                 │
├─────────────────────────────────────────────┤
│ • 每个告警都需要有人处理                    │
│ • 告警内容包含处理建议                      │
│ • 避免无法处理的告警                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. 可理解性                                 │
├─────────────────────────────────────────────┤
│ • 告警信息清晰明确                          │
│ • 包含足够的上下文                          │
│ • 使用模板变量提供详细信息                  │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. 及时性                                   │
├─────────────────────────────────────────────┤
│ • 严重告警快速通知                          │
│ • 合理设置持续时间                          │
│ • 避免延迟处理                              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 4. 准确性                                   │
├─────────────────────────────────────────────┤
│ • 避免误报                                  │
│ • 合理设置阈值                              │
│ • 使用持续时间过滤波动                      │
└─────────────────────────────────────────────┘

1.2 告警分级 #

text

告警严重级别：

┌─────────────────────────────────────────────┐
│ Critical（严重）                            │
├─────────────────────────────────────────────┤
│ • 需要立即处理                              │
│ • 影响核心业务                              │
│ • 示例：服务宕机、数据丢失                  │
│ • 通知方式：电话、短信                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ Warning（警告）                             │
├─────────────────────────────────────────────┤
│ • 需要关注                                  │
│ • 可能影响业务                              │
│ • 示例：资源使用率高                        │
│ • 通知方式：邮件、IM                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ Info（信息）                                │
├─────────────────────────────────────────────┤
│ • 仅供参考                                  │
│ • 不需要立即处理                            │
│ • 示例：配置变更                            │
│ • 通知方式：日志                            │
└─────────────────────────────────────────────┘

二、告警规则设计 #

2.1 阈值设置 #

yaml

# 好的阈值设置示例

groups:
  - name: resource_alerts
    rules:
      # CPU使用率 - 多级阈值
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      - alert: CriticalCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      # 内存使用率 - 多级阈值
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      - alert: CriticalMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      # 磁盘使用率 - 多级阈值
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free"

2.2 持续时间设置 #

yaml

# 持续时间设置建议

groups:
  - name: duration_examples
    rules:
      # 短持续时间 - 严重告警
      - alert: InstanceDown
        expr: up == 0
        for: 1m           # 1分钟
        labels:
          severity: critical

      # 中等持续时间 - 警告告警
      - alert: HighCPUUsage
        expr: 100 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
        for: 5m           # 5分钟
        labels:
          severity: warning

      # 长持续时间 - 信息告警
      - alert: LowThroughput
        expr: sum(rate(http_requests_total[5m])) < 10
        for: 30m          # 30分钟
        labels:
          severity: info

2.3 告警内容设计 #

yaml

# 好的告警内容示例

groups:
  - name: good_examples
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected on {{ $externalLabels.cluster }}"
          description: |
            Error rate is {{ $value | printf "%.2f" }}% which exceeds the 5% threshold.
            
            Impact: Users may experience errors when accessing the service.
            
            Affected services:
            {{ range .Alerts }}
            - {{ .Labels.service }}: {{ .Labels.instance }}
            {{ end }}
            
            Suggested actions:
            1. Check application logs for errors
            2. Review recent deployments
            3. Check database connectivity
            4. Scale up if needed
            
            Dashboard: https://grafana.example.com/d/error-dashboard
            Runbook: https://wiki.example.com/runbooks/high-error-rate

三、告警降噪 #

3.1 分组策略 #

yaml

# Alertmanager分组配置

route:
  # 按告警名称和严重级别分组
  group_by: ['alertname', 'severity']
  
  # 等待30秒收集同组告警
  group_wait: 30s
  
  # 每5分钟发送一次同组新告警
  group_interval: 5m
  
  # 每1小时重复发送一次
  repeat_interval: 1h
  
  routes:
    # 严重告警快速发送
    - match:
        severity: critical
      group_wait: 10s
      group_interval: 1m
      repeat_interval: 5m
    
    # 警告告警正常发送
    - match:
        severity: warning
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 30m

3.2 抑制规则 #

yaml

# 抑制规则示例

inhibit_rules:
  # 节点宕机抑制该节点的所有告警
  - source_match:
      alertname: 'InstanceDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']

  # 服务不可用抑制该服务的所有告警
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '.*'
    equal: ['service']

  # 严重告警抑制警告告警
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # 数据库宕机抑制数据库相关告警
  - source_match:
      alertname: 'DatabaseDown'
    target_match_re:
      alertname: '(HighDBConnections|SlowDBQueries).*'
    equal: ['database']

3.3 静默策略 #

text

静默使用场景：

┌─────────────────────────────────────────────┐
│ 1. 计划维护                                 │
├─────────────────────────────────────────────┤
│ • 系统升级                                  │
│ • 硬件维护                                  │
│ • 网络维护                                  │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 2. 已知问题                                 │
├─────────────────────────────────────────────┤
│ • 正在处理的问题                            │
│ • 等待修复的问题                            │
│ • 低优先级问题                              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 3. 测试环境                                 │
├─────────────────────────────────────────────┤
│ • 开发环境                                  │
│ • 测试环境                                  │
│ • 预发布环境                                │
└─────────────────────────────────────────────┘

四、告警路由 #

4.1 按团队路由 #

yaml

route:
  receiver: 'default-receiver'
  routes:
    # 后端团队
    - match:
        team: backend
      receiver: 'backend-team'
    
    # 前端团队
    - match:
        team: frontend
      receiver: 'frontend-team'
    
    # 基础设施团队
    - match:
        team: infrastructure
      receiver: 'infrastructure-team'
    
    # 数据库团队
    - match:
        team: database
      receiver: 'database-team'

receivers:
  - name: 'backend-team'
    slack_configs:
      - channel: '#backend-alerts'
  
  - name: 'frontend-team'
    slack_configs:
      - channel: '#frontend-alerts'
  
  - name: 'infrastructure-team'
    slack_configs:
      - channel: '#infra-alerts'
  
  - name: 'database-team'
    slack_configs:
      - channel: '#db-alerts'

4.2 按严重级别路由 #

yaml

route:
  receiver: 'default-receiver'
  routes:
    # 严重告警 - 多渠道通知
    - match:
        severity: critical
      receiver: 'critical-receiver'
      continue: true  # 继续匹配其他规则
    
    # 警告告警
    - match:
        severity: warning
      receiver: 'warning-receiver'

receivers:
  - name: 'critical-receiver'
    email_configs:
      - to: 'oncall@example.com'
    slack_configs:
      - channel: '#critical-alerts'
    pagerduty_configs:
      - service_key: 'xxx'
  
  - name: 'warning-receiver'
    slack_configs:
      - channel: '#warnings'

五、告警模板 #

5.1 自定义模板 #

yaml

# alertmanager.yml

global:
  resolve_timeout: 5m

templates:
  - '/etc/alertmanager/templates/*.tmpl'

receivers:
  - name: 'slack-receiver'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'

text

# /etc/alertmanager/templates/slack.tmpl

{{ define "slack.title" }}
[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}
{{ end }}

{{ define "slack.text" }}
{{ if eq .Status "firing" }}
:fire: *Alert Firing*
{{ else }}
:white_check_mark: *Alert Resolved*
{{ end }}

*Details:*
{{ range .Alerts }}
- Alert: {{ .Labels.alertname }}
  Instance: {{ .Labels.instance }}
  Value: {{ .Annotations.value }}
  Description: {{ .Annotations.description }}
{{ end }}

*Dashboard:* {{ .ExternalURL }}
{{ end }}

六、告警测试 #

6.1 手动触发告警 #

bash

# 使用amtool发送测试告警
amtool alert add \
    alertname=TestAlert \
    severity=warning \
    instance=test-instance \
    --alertmanager.url=http://localhost:9093

# 查看告警
amtool alert --alertmanager.url=http://localhost:9093

# 查看静默
amtool silence --alertmanager.url=http://localhost:9093

6.2 验证告警规则 #

bash

# 检查告警规则语法
promtool check rules alerting_rules.yml

# 查看Prometheus告警状态
curl http://localhost:9090/api/v1/alerts

# 查看告警规则
curl http://localhost:9090/api/v1/rules

七、总结 #

告警设计原则：

原则	说明
可操作性	每个告警都需要处理
可理解性	信息清晰明确
及时性	快速通知
准确性	避免误报

告警降噪策略：

策略	说明
分组	合并相似告警
抑制	高优先级抑制低优先级
静默	临时屏蔽告警

下一步，让我们学习可视化！