告警规则 #
一、告警规则概述 #
1.1 什么是告警规则 #
text
告警规则定义:
┌─────────────────────────────────────────────┐
│ 告警规则 │
├─────────────────────────────────────────────┤
│ • 基于PromQL表达式定义告警条件 │
│ • 配置在Prometheus中 │
│ • 触发后发送到Alertmanager │
│ • 支持标签、注释和持续时间 │
└─────────────────────────────────────────────┘
告警流程:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prometheus │ ──> │Alertmanager │ ──> │ 通知渠道 │
│ 告警规则 │ │ 告警处理 │ │ Email/Slack │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ 评估规则 │ 去重/分组 │ 发送通知
▼ ▼ ▼
1.2 规则文件结构 #
yaml
# alerting_rules.yml
groups:
- name: example_group # 规则组名称
interval: 30s # 评估间隔
rules: # 规则列表
- alert: AlertName # 告警名称
expr: up == 0 # PromQL表达式
for: 1m # 持续时间
labels: # 标签
severity: critical
annotations: # 注释
summary: "Instance down"
description: "Instance {{ $labels.instance }} is down"
二、告警规则语法 #
2.1 基本语法 #
yaml
groups:
- name: group_name
interval: 30s # 可选,评估间隔
rules:
- alert: AlertName
expr: <PromQL表达式>
for: <持续时间>
labels:
<标签名>: <标签值>
annotations:
<注释名>: <注释值>
2.2 字段说明 #
text
告警规则字段:
┌─────────────────────────────────────────────┐
│ alert │
├─────────────────────────────────────────────┤
│ • 告警名称 │
│ • 必须唯一 │
│ • 建议使用有意义的名称 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ expr │
├─────────────────────────────────────────────┤
│ • PromQL表达式 │
│ • 结果为真时触发告警 │
│ • 支持所有PromQL语法 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ for │
├─────────────────────────────────────────────┤
│ • 持续时间 │
│ • 表达式持续为真才触发 │
│ • 避免瞬时波动 │
│ • 可选,默认立即触发 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ labels │
├─────────────────────────────────────────────┤
│ • 附加标签 │
│ • 用于路由和分组 │
│ • 可选 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ annotations │
├─────────────────────────────────────────────┤
│ • 注释信息 │
│ • 用于描述告警 │
│ • 支持模板变量 │
│ • 可选 │
└─────────────────────────────────────────────┘
三、常用告警规则 #
3.1 主机监控告警 #
yaml
groups:
- name: node_alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free space"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has only {{ $value | printf \"%.2f\" }}% free space"
3.2 应用监控告警 #
yaml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is {{ $value | printf \"%.2f\" }} seconds"
- alert: LowThroughput
expr: sum(rate(http_requests_total[5m])) < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low throughput detected"
description: "Request rate is {{ $value | printf \"%.2f\" }} requests/second"
- alert: TooManyConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Too many MySQL connections"
description: "Connection usage is {{ $value | printf \"%.2f\" }}%"
3.3 服务可用性告警 #
yaml
groups:
- name: availability_alerts
rules:
- alert: ServiceAvailabilityLow
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100 < 99
for: 5m
labels:
severity: critical
annotations:
summary: "Service availability is low"
description: "Availability is {{ $value | printf \"%.2f\" }}%"
- alert: ErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 4
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget is burning too fast"
description: "Error rate is {{ $value | printf \"%.4f\" }}"
四、模板变量 #
4.1 常用模板变量 #
text
模板变量:
┌─────────────────────────────────────────────┐
│ $labels │
├─────────────────────────────────────────────┤
│ • 访问告警的标签 │
│ • 示例:{{ $labels.instance }} │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ $value │
├─────────────────────────────────────────────┤
│ • 告警表达式的值 │
│ • 示例:{{ $value }} │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ $externalLabels │
├─────────────────────────────────────────────┤
│ • Prometheus的外部标签 │
│ • 示例:{{ $externalLabels.cluster }} │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ $externalURL │
├─────────────────────────────────────────────┤
│ • Prometheus的外部URL │
│ • 示例:{{ $externalURL }} │
└─────────────────────────────────────────────┘
4.2 模板函数 #
yaml
# 数字格式化
{{ $value | printf "%.2f" }}
# 人类可读字节
{{ $value | humanizeBytes }}
# 人类可读时间
{{ $value | humanizeDuration }}
# 人类可读百分比
{{ $value | humanizePercentage }}
# 转换为时间
{{ $value | toDate }}
# 标题大小写
{{ $labels.instance | title }}
4.3 模板示例 #
yaml
groups:
- name: template_examples
rules:
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: |
Disk {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf "%.2f" }}% free space.
Total: {{ query "node_filesystem_size_bytes{instance=\"$labels.instance\",mountpoint=\"$labels.mountpoint\"}" | first | value | humanizeBytes }}
Available: {{ query "node_filesystem_avail_bytes{instance=\"$labels.instance\",mountpoint=\"$labels.mountpoint\"}" | first | value | humanizeBytes }}
五、告警状态 #
text
告警状态:
┌─────────────────────────────────────────────┐
│ inactive(未激活) │
├─────────────────────────────────────────────┤
│ • 告警条件不满足 │
│ • 正常状态 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ pending(待定) │
├─────────────────────────────────────────────┤
│ • 告警条件满足 │
│ • 等待for持续时间 │
│ • 尚未发送到Alertmanager │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ firing(触发) │
├─────────────────────────────────────────────┤
│ • 告警条件持续满足 │
│ • 已发送到Alertmanager │
│ • 正在告警 │
└─────────────────────────────────────────────┘
状态转换:
inactive ──(条件满足)──> pending ──(持续时间到)──> firing
▲ │
└────────────────────(条件不满足)─────────────────────────┘
六、配置告警规则 #
6.1 Prometheus配置 #
yaml
# prometheus.yml
global:
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- '/etc/prometheus/alerts/*.yml'
- '/etc/prometheus/rules/*.yml'
6.2 热加载规则 #
bash
# 修改规则文件后热加载
curl -X POST http://localhost:9090/-/reload
# 查看已加载的规则
curl http://localhost:9090/api/v1/rules
# 查看告警状态
curl http://localhost:9090/api/v1/alerts
七、最佳实践 #
7.1 告警分级 #
yaml
# 告警严重级别
severity: critical # 严重,需要立即处理
severity: warning # 警告,需要关注
severity: info # 信息,仅供参考
7.2 告警命名 #
text
告警命名规范:
┌─────────────────────────────────────────────┐
│ 好的命名 │
├─────────────────────────────────────────────┤
│ InstanceDown │
│ HighCPUUsage │
│ DiskSpaceLow │
│ HighErrorRate │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 不好的命名 │
├─────────────────────────────────────────────┤
│ Alert1 │
│ CPUAlert │
│ DiskWarning │
└─────────────────────────────────────────────┘
7.3 告警内容 #
yaml
# 好的告警内容
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: |
CPU usage on {{ $labels.instance }} is {{ $value | printf "%.2f" }}%.
This has been ongoing for more than 5 minutes.
Please investigate the cause.
# 不好的告警内容
annotations:
summary: "CPU alert"
description: "CPU is high"
八、总结 #
告警规则要点:
| 字段 | 说明 |
|---|---|
| alert | 告警名称 |
| expr | PromQL表达式 |
| for | 持续时间 |
| labels | 标签 |
| annotations | 注释 |
告警状态:
| 状态 | 说明 |
|---|---|
| inactive | 未激活 |
| pending | 待定 |
| firing | 触发 |
下一步,让我们学习Alertmanager!
最后更新:2026-03-27