告警规则 #

一、告警规则概述 #

1.1 什么是告警规则 #

text

告警规则定义：

┌─────────────────────────────────────────────┐
│ 告警规则                                    │
├─────────────────────────────────────────────┤
│ • 基于PromQL表达式定义告警条件              │
│ • 配置在Prometheus中                        │
│ • 触发后发送到Alertmanager                  │
│ • 支持标签、注释和持续时间                  │
└─────────────────────────────────────────────┘

告警流程：

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Prometheus  │ ──> │Alertmanager │ ──> │  通知渠道   │
│  告警规则   │     │  告警处理   │     │ Email/Slack │
└─────────────┘     └─────────────┘     └─────────────┘
      │                   │                   │
      │ 评估规则          │ 去重/分组         │ 发送通知
      ▼                   ▼                   ▼

1.2 规则文件结构 #

yaml

# alerting_rules.yml

groups:
  - name: example_group    # 规则组名称
    interval: 30s          # 评估间隔
    rules:                 # 规则列表
      - alert: AlertName   # 告警名称
        expr: up == 0      # PromQL表达式
        for: 1m            # 持续时间
        labels:            # 标签
          severity: critical
        annotations:       # 注释
          summary: "Instance down"
          description: "Instance {{ $labels.instance }} is down"

二、告警规则语法 #

2.1 基本语法 #

yaml

groups:
  - name: group_name
    interval: 30s          # 可选，评估间隔
    rules:
      - alert: AlertName
        expr: <PromQL表达式>
        for: <持续时间>
        labels:
          <标签名>: <标签值>
        annotations:
          <注释名>: <注释值>

2.2 字段说明 #

text

告警规则字段：

┌─────────────────────────────────────────────┐
│ alert                                       │
├─────────────────────────────────────────────┤
│ • 告警名称                                  │
│ • 必须唯一                                  │
│ • 建议使用有意义的名称                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ expr                                        │
├─────────────────────────────────────────────┤
│ • PromQL表达式                              │
│ • 结果为真时触发告警                        │
│ • 支持所有PromQL语法                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ for                                         │
├─────────────────────────────────────────────┤
│ • 持续时间                                  │
│ • 表达式持续为真才触发                      │
│ • 避免瞬时波动                              │
│ • 可选，默认立即触发                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ labels                                      │
├─────────────────────────────────────────────┤
│ • 附加标签                                  │
│ • 用于路由和分组                            │
│ • 可选                                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ annotations                                 │
├─────────────────────────────────────────────┤
│ • 注释信息                                  │
│ • 用于描述告警                              │
│ • 支持模板变量                              │
│ • 可选                                      │
└─────────────────────────────────────────────┘

三、常用告警规则 #

3.1 主机监控告警 #

yaml

groups:
  - name: node_alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free space"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free space"

3.2 应用监控告警 #

yaml

groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is {{ $value | printf \"%.2f\" }} seconds"

      - alert: LowThroughput
        expr: sum(rate(http_requests_total[5m])) < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput detected"
          description: "Request rate is {{ $value | printf \"%.2f\" }} requests/second"

      - alert: TooManyConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Too many MySQL connections"
          description: "Connection usage is {{ $value | printf \"%.2f\" }}%"

3.3 服务可用性告警 #

yaml

groups:
  - name: availability_alerts
    rules:
      - alert: ServiceAvailabilityLow
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) 
          / 
          sum(rate(http_requests_total[5m])) * 100 < 99
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Service availability is low"
          description: "Availability is {{ $value | printf \"%.2f\" }}%"

      - alert: ErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (1 - 0.999) * 4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error budget is burning too fast"
          description: "Error rate is {{ $value | printf \"%.4f\" }}"

四、模板变量 #

4.1 常用模板变量 #

text

模板变量：

┌─────────────────────────────────────────────┐
│ $labels                                     │
├─────────────────────────────────────────────┤
│ • 访问告警的标签                            │
│ • 示例：{{ $labels.instance }}              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ $value                                      │
├─────────────────────────────────────────────┤
│ • 告警表达式的值                            │
│ • 示例：{{ $value }}                        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ $externalLabels                             │
├─────────────────────────────────────────────┤
│ • Prometheus的外部标签                      │
│ • 示例：{{ $externalLabels.cluster }}       │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ $externalURL                                │
├─────────────────────────────────────────────┤
│ • Prometheus的外部URL                       │
│ • 示例：{{ $externalURL }}                  │
└─────────────────────────────────────────────┘

4.2 模板函数 #

yaml

# 数字格式化
{{ $value | printf "%.2f" }}

# 人类可读字节
{{ $value | humanizeBytes }}

# 人类可读时间
{{ $value | humanizeDuration }}

# 人类可读百分比
{{ $value | humanizePercentage }}

# 转换为时间
{{ $value | toDate }}

# 标题大小写
{{ $labels.instance | title }}

4.3 模板示例 #

yaml

groups:
  - name: template_examples
    rules:
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: |
            Disk {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf "%.2f" }}% free space.
            Total: {{ query "node_filesystem_size_bytes{instance=\"$labels.instance\",mountpoint=\"$labels.mountpoint\"}" | first | value | humanizeBytes }}
            Available: {{ query "node_filesystem_avail_bytes{instance=\"$labels.instance\",mountpoint=\"$labels.mountpoint\"}" | first | value | humanizeBytes }}

五、告警状态 #

text

告警状态：

┌─────────────────────────────────────────────┐
│ inactive（未激活）                          │
├─────────────────────────────────────────────┤
│ • 告警条件不满足                            │
│ • 正常状态                                  │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ pending（待定）                             │
├─────────────────────────────────────────────┤
│ • 告警条件满足                              │
│ • 等待for持续时间                           │
│ • 尚未发送到Alertmanager                    │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ firing（触发）                              │
├─────────────────────────────────────────────┤
│ • 告警条件持续满足                          │
│ • 已发送到Alertmanager                      │
│ • 正在告警                                  │
└─────────────────────────────────────────────┘

状态转换：

inactive ──(条件满足)──> pending ──(持续时间到)──> firing
    ▲                                                        │
    └────────────────────(条件不满足)─────────────────────────┘

六、配置告警规则 #

6.1 Prometheus配置 #

yaml

# prometheus.yml

global:
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'

rule_files:
  - '/etc/prometheus/alerts/*.yml'
  - '/etc/prometheus/rules/*.yml'

6.2 热加载规则 #

bash

# 修改规则文件后热加载
curl -X POST http://localhost:9090/-/reload

# 查看已加载的规则
curl http://localhost:9090/api/v1/rules

# 查看告警状态
curl http://localhost:9090/api/v1/alerts

七、最佳实践 #

7.1 告警分级 #

yaml

# 告警严重级别

severity: critical   # 严重，需要立即处理
severity: warning    # 警告，需要关注
severity: info       # 信息，仅供参考

7.2 告警命名 #

text

告警命名规范：

┌─────────────────────────────────────────────┐
│ 好的命名                                    │
├─────────────────────────────────────────────┤
│ InstanceDown                                │
│ HighCPUUsage                                │
│ DiskSpaceLow                                │
│ HighErrorRate                               │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ 不好的命名                                  │
├─────────────────────────────────────────────┤
│ Alert1                                      │
│ CPUAlert                                    │
│ DiskWarning                                 │
└─────────────────────────────────────────────┘

7.3 告警内容 #

yaml

# 好的告警内容
annotations:
  summary: "High CPU usage on {{ $labels.instance }}"
  description: |
    CPU usage on {{ $labels.instance }} is {{ $value | printf "%.2f" }}%.
    This has been ongoing for more than 5 minutes.
    Please investigate the cause.

# 不好的告警内容
annotations:
  summary: "CPU alert"
  description: "CPU is high"

八、总结 #

告警规则要点：

字段	说明
alert	告警名称
expr	PromQL表达式
for	持续时间
labels	标签
annotations	注释

告警状态：

状态	说明
inactive	未激活
pending	待定
firing	触发

下一步，让我们学习Alertmanager！