Grafana 告警配置 #

告警系统概述 #

Grafana 告警系统允许你基于数据源中的数据定义告警规则，当条件满足时自动发送通知。

text

┌─────────────────────────────────────────────────────────────┐
│                    告警系统架构                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────┐                                           │
│   │  数据源      │  Prometheus / InfluxDB / MySQL           │
│   └──────┬──────┘                                           │
│          │                                                  │
│          ▼                                                  │
│   ┌─────────────┐                                           │
│   │  告警规则    │  定义触发条件                             │
│   └──────┬──────┘                                           │
│          │                                                  │
│          ▼                                                  │
│   ┌─────────────┐                                           │
│   │  评估引擎    │  周期性评估规则                           │
│   └──────┬──────┘                                           │
│          │                                                  │
│          ▼                                                  │
│   ┌─────────────┐                                           │
│   │  告警状态    │  OK → Pending → Firing → Resolved        │
│   └──────┬──────┘                                           │
│          │                                                  │
│          ▼                                                  │
│   ┌─────────────┐                                           │
│   │  通知渠道    │  邮件、Slack、钉钉、Webhook 等            │
│   └─────────────┘                                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警状态流转 #

text

┌─────────────────────────────────────────────────────────────┐
│                    告警状态流转                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    ┌─────────┐                              │
│                    │   OK    │  正常状态                     │
│                    └────┬────┘                              │
│                         │                                   │
│                    条件触发                                  │
│                         │                                   │
│                         ▼                                   │
│                    ┌─────────┐                              │
│                    │ Pending │  等待确认（持续时间）          │
│                    └────┬────┘                              │
│                         │                                   │
│                   持续时间到达                               │
│                         │                                   │
│                         ▼                                   │
│                    ┌─────────┐                              │
│                    │ Firing  │  触发告警，发送通知            │
│                    └────┬────┘                              │
│                         │                                   │
│                    条件恢复                                  │
│                         │                                   │
│                         ▼                                   │
│                    ┌─────────┐                              │
│                    │ Resolved│  已恢复，发送恢复通知          │
│                    └────┬────┘                              │
│                         │                                   │
│                         ▼                                   │
│                    ┌─────────┐                              │
│                    │   OK    │  回到正常状态                  │
│                    └─────────┘                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警规则配置 #

创建告警规则 #

text

┌─────────────────────────────────────────────────────────────┐
│                    创建告警规则步骤                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   1. 进入 Alerting → Alert rules                            │
│                                                             │
│   2. 点击 "New alert rule"                                   │
│                                                             │
│   3. 配置规则：                                              │
│      ├── 规则名称                                            │
│      ├── 数据源和查询                                        │
│      ├── 表达式和条件                                        │
│      ├── 评估行为                                            │
│      └── 文件夹和分组                                        │
│                                                             │
│   4. 保存规则                                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

规则配置详解 #

yaml

Alert Rule:
  Name: High CPU Usage
  
  Query:
    Data source: Prometheus
    Query: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
  Expression:
    Type: Classic condition
    Condition: WHEN last() OF query(A) IS ABOVE 80
    
  Evaluation:
    Evaluate every: 1m
    For: 5m
    
  Annotations:
    Summary: High CPU usage detected on {{ $labels.instance }}
    Description: CPU usage is {{ $value }}%
    Runbook URL: https://wiki.example.com/runbooks/high-cpu
    
  Labels:
    severity: warning
    team: infrastructure

表达式类型 #

text

┌─────────────────────────────────────────────────────────────┐
│                    表达式类型                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Classic condition（经典条件）：                              │
│  ├── WHEN last() OF query(A) IS ABOVE 80                    │
│  ├── WHEN avg() OF query(A) IS BELOW 10                     │
│  └── 支持函数：last(), avg(), min(), max(), sum(), count()  │
│                                                             │
│  Math（数学表达式）：                                        │
│  ├── $A > 80                                                │
│  ├── $A + $B > 100                                          │
│  └── 支持数学运算和函数                                      │
│                                                             │
│  Reduce（聚合）：                                            │
│  ├── 将时间序列聚合为单个值                                  │
│  ├── Mode: Strict / Replace non-numeric                     │
│  └── Function: Last, Mean, Min, Max, Sum, Count             │
│                                                             │
│  Resample（重采样）：                                        │
│  └── 改变数据的时间分辨率                                    │
│                                                             │
│  Threshold（阈值）：                                         │
│  ├── $A > 80                                                │
│  └── 支持多阈值配置                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

评估行为 #

text

┌─────────────────────────────────────────────────────────────┐
│                    评估行为配置                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Evaluate every（评估间隔）：                                │
│  ├── 多久评估一次规则                                       │
│  ├── 示例：1m, 5m, 10m                                      │
│  └── 建议：根据数据采集频率设置                              │
│                                                             │
│  For（持续时间）：                                           │
│  ├── 条件持续多久才触发告警                                  │
│  ├── 示例：0s, 1m, 5m, 10m                                  │
│  └── 作用：避免瞬时波动触发告警                              │
│                                                             │
│  配置示例：                                                  │
│  ├── 评估间隔：1m                                           │
│  ├── 持续时间：5m                                           │
│  └── 含义：连续 5 次评估都满足条件才触发                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警标签和注解 #

yaml

Labels（标签）:
  severity: critical | warning | info
  team: infrastructure | application | database
  environment: production | staging | development
  service: api | web | worker

Annotations（注解）:
  summary: 简短描述
  description: 详细描述
  runbook_url: 操作手册链接
  dashboard_url: 相关仪表板链接
  grafana_folder: Grafana 文件夹

使用模板变量 #

text

┌─────────────────────────────────────────────────────────────┐
│                    模板变量                                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  标签值：                                                    │
│  {{ $labels.instance }}    实例名称                         │
│  {{ $labels.job }}         任务名称                         │
│  {{ $labels.service }}     服务名称                         │
│                                                             │
│  指标值：                                                    │
│  {{ $value }}              当前值                           │
│  {{ $values.A }}           查询 A 的值                      │
│  {{ $values.A.Labels }}    查询 A 的标签                    │
│                                                             │
│  时间：                                                      │
│  {{ .Time }}               告警时间                         │
│  {{ .StartsAt }}           开始时间                         │
│  {{ .EndsAt }}             结束时间                         │
│                                                             │
│  示例：                                                     │
│  Summary: High CPU on {{ $labels.instance }}                │
│  Description: CPU is at {{ $value }}%                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

通知渠道配置 #

Contact Point（联系点） #

text

┌─────────────────────────────────────────────────────────────┐
│                    联系点类型                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  即时通讯：                                                  │
│  ├── Slack         Slack 频道                               │
│  ├── Discord       Discord 频道                             │
│  ├── Telegram      Telegram 群组                            │
│  ├── DingTalk      钉钉机器人                               │
│  ├── WeCom         企业微信                                 │
│  └── Feishu        飞书机器人                               │
│                                                             │
│  邮件：                                                      │
│  └── Email         邮件通知                                 │
│                                                             │
│  自动化：                                                    │
│  ├── Webhook       HTTP 回调                                │
│  ├── PagerDuty     事件管理平台                             │
│  ├── Opsgenie      事件管理平台                             │
│  ├── VictorOps     事件管理平台                             │
│  └── Kafka         Kafka 消息队列                           │
│                                                             │
│  其他：                                                      │
│  └── Pushover      移动推送                                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

配置邮件通知 #

text

┌─────────────────────────────────────────────────────────────┐
│                    邮件配置                                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Contact point:                                             │
│  ├── Name: Email Team                                       │
│  ├── Type: Email                                            │
│  └── Addresses: team@example.com, oncall@example.com        │
│                                                             │
│  SMTP 配置（grafana.ini）：                                  │
│  [smtp]                                                     │
│  enabled = true                                             │
│  host = smtp.example.com:587                                │
│  user = grafana@example.com                                 │
│  password = ********                                        │
│  from_address = grafana@example.com                         │
│  from_name = Grafana Alerting                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

配置 Slack 通知 #

yaml

Contact Point:
  Name: Slack Team
  Type: Slack
  Integration settings:
    Webhook URL: https://hooks.slack.com/services/xxx/xxx/xxx
    Channel: #alerts
    Username: Grafana
    Icon URL: https://grafana.com/img/fav32.png
  Optional settings:
    Mention users: @here
    Mention groups: @oncall
    Token: xoxb-xxx-xxx-xxx

配置钉钉通知 #

yaml

Contact Point:
  Name: DingTalk Team
  Type: DingTalk
  Integration settings:
    Webhook URL: https://oapi.dingtalk.com/robot/send?access_token=xxx
    Secret: SECxxx (加签密钥)
    Message Type: Text / Link / Markdown
  Message content:
    Title: {{ .CommonLabels.alertname }}
    Content: |
      告警名称: {{ .CommonLabels.alertname }}
      告警级别: {{ .CommonLabels.severity }}
      告警详情: {{ .CommonAnnotations.description }}

配置 Webhook 通知 #

yaml

Contact Point:
  Name: Custom Webhook
  Type: Webhook
  Integration settings:
    URL: https://api.example.com/alerts
    HTTP Method: POST
    Authorization Header: Bearer xxx
  Message content:
    Body: |
      {
        "alertname": "{{ .CommonLabels.alertname }}",
        "severity": "{{ .CommonLabels.severity }}",
        "instance": "{{ .CommonLabels.instance }}",
        "value": "{{ $value }}",
        "summary": "{{ .CommonAnnotations.summary }}"
      }

通知策略 #

通知策略配置 #

text

┌─────────────────────────────────────────────────────────────┐
│                    通知策略结构                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Root Policy（根策略）：                                     │
│  ├── 默认匹配所有告警                                       │
│  └── 设置默认联系点和分组                                   │
│                                                             │
│  Child Policies（子策略）：                                  │
│  ├── 根据标签匹配特定告警                                   │
│  ├── 覆盖父策略设置                                         │
│  └── 可嵌套多层                                             │
│                                                             │
│  匹配顺序：                                                  │
│  ├── 从根策略开始                                           │
│  ├── 按顺序检查子策略                                       │
│  └── 使用第一个匹配的策略                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

策略配置示例 #

yaml

Notification Policy:
  Root:
    Group by: ['alertname', 'instance']
    Group wait: 30s
    Group interval: 5m
    Repeat interval: 12h
    Contact point: Default Team
    
  Child Policies:
    - Matcher:
        severity: critical
      Contact point: Critical Team
      Group wait: 10s
      Group interval: 1m
      Repeat interval: 1h
      
    - Matcher:
        team: database
      Contact point: DBA Team
      Group by: ['alertname']
      
    - Matcher:
        severity: warning
        team: application
      Contact point: App Team
      Repeat interval: 4h

分组设置 #

text

┌─────────────────────────────────────────────────────────────┐
│                    分组设置说明                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Group by（分组依据）：                                      │
│  ├── 按指定标签分组告警                                     │
│  ├── 相同标签值的告警分为一组                               │
│  └── 示例：['alertname', 'instance']                        │
│                                                             │
│  Group wait（组等待时间）：                                  │
│  ├── 等待多长时间收集同组告警                               │
│  ├── 然后一起发送通知                                       │
│  └── 示例：30s                                              │
│                                                             │
│  Group interval（组间隔）：                                  │
│  ├── 同一组告警发送通知的最小间隔                           │
│  └── 示例：5m                                               │
│                                                             │
│  Repeat interval（重复间隔）：                               │
│  ├── 告警持续触发时重复发送通知的间隔                       │
│  └── 示例：12h                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警静默 #

创建静默规则 #

text

┌─────────────────────────────────────────────────────────────┐
│                    静默规则配置                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  创建步骤：                                                  │
│  1. 进入 Alerting → Silences                                │
│  2. 点击 "New silence"                                       │
│  3. 配置匹配器和持续时间                                    │
│  4. 保存                                                    │
│                                                             │
│  配置示例：                                                  │
│  ├── Matchers: instance = server-01                         │
│  ├── Start: 2024-01-15 10:00                                │
│  ├── Duration: 2h                                           │
│  ├── Reason: Scheduled maintenance                          │
│  └── Created by: admin                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

静默匹配器 #

text

┌─────────────────────────────────────────────────────────────┐
│                    匹配器语法                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  精确匹配：                                                  │
│  instance = server-01                                       │
│                                                             │
│  正则匹配：                                                  │
│  instance =~ server-.*                                      │
│                                                             │
│  排除匹配：                                                  │
│  instance != server-01                                      │
│  instance !~ server-.*                                      │
│                                                             │
│  多条件匹配：                                                │
│  instance = server-01, severity = critical                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警规则示例 #

系统监控告警 #

yaml

High CPU Usage:
  Query: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  Condition: WHEN last() OF query(A) IS ABOVE 80
  For: 5m
  Labels:
    severity: warning
    team: infrastructure
  Annotations:
    summary: High CPU usage on {{ $labels.instance }}
    description: CPU usage is {{ $value }}%

High Memory Usage:
  Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
  Condition: WHEN last() OF query(A) IS ABOVE 90
  For: 5m
  Labels:
    severity: critical
    team: infrastructure
  Annotations:
    summary: High memory usage on {{ $labels.instance }}
    description: Memory usage is {{ $value }}%

Disk Space Low:
  Query: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100
  Condition: WHEN last() OF query(A) IS ABOVE 85
  For: 5m
  Labels:
    severity: warning
    team: infrastructure
  Annotations:
    summary: Disk space low on {{ $labels.instance }}
    description: Disk {{ $labels.mountpoint }} usage is {{ $value }}%

应用监控告警 #

yaml

High Error Rate:
  Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
  Condition: WHEN last() OF query(A) IS ABOVE 5
  For: 2m
  Labels:
    severity: critical
    team: application
  Annotations:
    summary: High error rate detected
    description: Error rate is {{ $value }}%

High Latency:
  Query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
  Condition: WHEN last() OF query(A) IS ABOVE 1
  For: 5m
  Labels:
    severity: warning
    team: application
  Annotations:
    summary: High latency detected
    description: P95 latency is {{ $value }}s

Service Down:
  Query: up{job="my-service"}
  Condition: WHEN last() OF query(A) IS BELOW 1
  For: 1m
  Labels:
    severity: critical
    team: application
  Annotations:
    summary: Service {{ $labels.job }} is down
    description: Instance {{ $labels.instance }} is not responding

数据库监控告警 #

yaml

MySQL Down:
  Query: mysql_up
  Condition: WHEN last() OF query(A) IS BELOW 1
  For: 1m
  Labels:
    severity: critical
    team: database
  Annotations:
    summary: MySQL is down
    description: MySQL instance {{ $labels.instance }} is not responding

MySQL Too Many Connections:
  Query: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100
  Condition: WHEN last() OF query(A) IS ABOVE 80
  For: 2m
  Labels:
    severity: warning
    team: database
  Annotations:
    summary: MySQL connection usage high
    description: Connection usage is {{ $value }}%

Redis Down:
  Query: redis_up
  Condition: WHEN last() OF query(A) IS BELOW 1
  For: 1m
  Labels:
    severity: critical
    team: database
  Annotations:
    summary: Redis is down
    description: Redis instance {{ $labels.instance }} is not responding

告警管理 #

查看告警状态 #

text

┌─────────────────────────────────────────────────────────────┐
│                    告警状态视图                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Alerting → Alert rules:                                    │
│  ├── 查看所有告警规则                                       │
│  ├── 查看规则状态                                           │
│  └── 编辑/删除规则                                          │
│                                                             │
│  Alerting → Alert groups:                                   │
│  ├── 按分组查看告警                                         │
│  ├── 查看告警详情                                           │
│  └── 操作告警（静默、确认）                                 │
│                                                             │
│  Dashboard Panel:                                            │
│  ├── 面板上显示告警状态                                     │
│  └── 点击查看告警详情                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

告警操作 #

text

┌─────────────────────────────────────────────────────────────┐
│                    告警操作                                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Silence（静默）：                                           │
│  ├── 暂时停止告警通知                                       │
│  ├── 设置持续时间                                           │
│  └── 添加静默原因                                           │
│                                                             │
│  Pause（暂停）：                                             │
│  ├── 暂停告警规则评估                                       │
│  └── 不再触发新告警                                         │
│                                                             │
│  Test（测试）：                                              │
│  ├── 测试告警规则                                           │
│  └── 发送测试通知                                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

最佳实践 #

告警设计原则 #

text

┌─────────────────────────────────────────────────────────────┐
│                    告警设计原则                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 可操作性                                                │
│     └── 每个告警都应该有明确的处理方式                       │
│                                                             │
│  2. 避免告警疲劳                                            │
│     ├── 合理设置阈值                                        │
│     ├── 使用持续时间过滤瞬时波动                            │
│     └── 分组和去重                                          │
│                                                             │
│  3. 清晰的描述                                              │
│     ├── 告警名称简洁明了                                    │
│     ├── 描述包含关键信息                                    │
│     └── 提供处理文档链接                                    │
│                                                             │
│  4. 合理的严重级别                                          │
│     ├── Critical: 需要立即处理                              │
│     ├── Warning: 需要关注                                   │
│     └── Info: 仅供参考                                      │
│                                                             │
│  5. 分层告警                                                │
│     ├── 基础设施层                                          │
│     ├── 应用层                                              │
│     └── 业务层                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

阈值设置建议 #

text

┌─────────────────────────────────────────────────────────────┐
│                    阈值设置参考                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  CPU 使用率：                                                │
│  ├── Warning: > 80%                                         │
│  └── Critical: > 95%                                        │
│                                                             │
│  内存使用率：                                                │
│  ├── Warning: > 85%                                         │
│  └── Critical: > 95%                                        │
│                                                             │
│  磁盘使用率：                                                │
│  ├── Warning: > 85%                                         │
│  └── Critical: > 95%                                        │
│                                                             │
│  错误率：                                                    │
│  ├── Warning: > 1%                                          │
│  └── Critical: > 5%                                         │
│                                                             │
│  响应时间（P95）：                                           │
│  ├── Warning: > 500ms                                       │
│  └── Critical: > 2s                                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

通知策略建议 #

text

┌─────────────────────────────────────────────────────────────┐
│                    通知策略建议                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  分组：                                                      │
│  ├── 按服务/应用分组                                        │
│  └── 避免过多单独通知                                       │
│                                                             │
│  频率：                                                      │
│  ├── Critical: 立即通知，1 小时重复                         │
│  ├── Warning: 5 分钟后通知，4 小时重复                       │
│  └── Info: 批量通知                                         │
│                                                             │
│  渠道：                                                      │
│  ├── Critical: 电话/短信 + IM                               │
│  ├── Warning: IM + 邮件                                     │
│  └── Info: 邮件                                             │
│                                                             │
│  升级：                                                      │
│  ├── 长时间未处理自动升级                                   │
│  └── 通知更高级别人员                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

下一步 #

掌握了告警配置后，接下来学习高级主题，了解 Provisioning、插件开发等高级功能！