监控告警 #
一、监控概述 #
1.1 监控架构 #
text
监控架构
┌─────────────────────────────────────────────────────────────┐
│ │
│ CockroachDB 集群 │
│ ├── Node 1 (/_status/vars) │
│ ├── Node 2 (/_status/vars) │
│ └── Node 3 (/_status/vars) │
│ │ │
│ ▼ │
│ Prometheus │
│ ├── 抓取指标 │
│ ├── 存储时序数据 │
│ └── 告警规则 │
│ │ │
│ ▼ │
│ Grafana │
│ ├── 可视化仪表板 │
│ └── 告警通知 │
│ │
└─────────────────────────────────────────────────────────────┘
1.2 监控指标 #
| 指标类别 | 说明 |
|---|---|
| 系统指标 | CPU、内存、磁盘、网络 |
| 集群指标 | 节点状态、Range分布 |
| SQL指标 | 查询延迟、吞吐量 |
| 存储指标 | 写入速度、压缩率 |
二、Prometheus集成 #
2.1 配置Prometheus #
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'cockroachdb'
metrics_path: '/_status/vars'
static_configs:
- targets:
- 'node1:8080'
- 'node2:8080'
- 'node3:8080'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- 'cockroachdb_alerts.yml'
2.2 告警规则 #
yaml
# cockroachdb_alerts.yml
groups:
- name: cockroachdb
rules:
- alert: CockroachDBNodeDown
expr: up{job="cockroachdb"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CockroachDB node down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute."
- alert: CockroachDBHighLatency
expr: histogram_quantile(0.99, rate(sql_latency_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High SQL latency"
description: "99th percentile latency is above 1 second."
- alert: CockroachDBLowReplicas
expr: count(replicas) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Low replica count"
description: "Replica count is below 3."
- alert: CockroachDBHighDiskUsage
expr: (storage_usage / storage_capacity) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage"
description: "Disk usage is above 80%."
- alert: CockroachDBSlowQueries
expr: rate(sql_latency_count[5m]) > 100 and histogram_quantile(0.95, rate(sql_latency_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries detected"
description: "High rate of slow queries detected."
三、Grafana仪表板 #
3.1 导入仪表板 #
text
导入官方仪表板:
1. 访问 Grafana
2. 导航到 Dashboards > Import
3. 输入仪表板 ID: 10908 (CockroachDB Official)
4. 选择 Prometheus 数据源
5. 点击 Import
官方仪表板:
- CockroachDB Overview: 10908
- CockroachDB Runtime: 10909
- CockroachDB SQL: 10910
- CockroachDB Storage: 10911
3.2 自定义仪表板 #
json
{
"dashboard": {
"title": "CockroachDB Custom Dashboard",
"panels": [
{
"title": "QPS",
"type": "graph",
"targets": [
{
"expr": "rate(sql_latency_count[1m])",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(sql_latency_bucket[5m]))",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Replica Count",
"type": "stat",
"targets": [
{
"expr": "count(replicas)",
"legendFormat": "Total"
}
]
},
{
"title": "Disk Usage",
"type": "gauge",
"targets": [
{
"expr": "storage_usage / storage_capacity * 100",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
四、关键指标 #
4.1 SQL指标 #
yaml
# SQL 关键指标
# 查询吞吐量 (QPS)
sql_latency_count
# 查询延迟
sql_latency_bucket
sql_latency_sum
# 活跃连接数
sql_conns
# 事务统计
txn_commits
txn_aborts
txn_restarts
4.2 存储指标 #
yaml
# 存储关键指标
# 存储容量
storage_capacity
storage_usage
# 写入速度
storage_write_bytes
# 读取速度
storage_read_bytes
# 压缩率
storage_compaction_bytes
4.3 集群指标 #
yaml
# 集群关键指标
# 节点存活
up
# Range数量
replicas
# Leader数量
lease_holder
# 副本状态
replicas_quiescent
replicas_leaseholders
五、Web UI监控 #
5.1 内置监控 #
text
访问 Web UI:
http://localhost:8080
主要功能:
├── Overview: 集群概览
├── Metrics: 详细指标
├── Databases: 数据库管理
├── SQL Activity: SQL活动
├── Jobs: 后台任务
├── Advanced Debug: 高级调试
└── Settings: 设置
5.2 SQL活动监控 #
sql
-- 查看SQL统计
SELECT
query,
count,
mean_latency,
max_latency
FROM crdb_internal.node_statement_statistics
ORDER BY count DESC
LIMIT 10;
-- 查看活跃事务
SELECT * FROM crdb_internal.cluster_transactions;
-- 查看慢查询
SELECT
query,
mean_latency / 1000000 AS latency_ms
FROM crdb_internal.node_statement_statistics
WHERE mean_latency > 1000000000 -- 超过1秒
ORDER BY mean_latency DESC;
六、告警配置 #
6.1 Alertmanager配置 #
yaml
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'team-pagerduty'
- match:
severity: warning
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
send_resolved: true
- name: 'team-pagerduty'
pagerduty_configs:
- service_key: 'your-service-key'
severity: critical
6.2 告警规则示例 #
yaml
groups:
- name: cockroachdb_critical
rules:
- alert: CockroachDBClusterDown
expr: count(up{job="cockroachdb"} == 1) < 2
for: 1m
labels:
severity: critical
annotations:
summary: "CockroachDB cluster critical"
description: "Less than 2 nodes are up. Cluster may be unavailable."
- alert: CockroachDBReplicaUnavailability
expr: count(replicas) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Replica unavailability"
description: "Some ranges have less than 3 replicas."
- name: cockroachdb_warning
rules:
- alert: CockroachDBHighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024 * 1024) > 16
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 16GB."
- alert: CockroachDBSlowReplication
expr: rate(liveness_liveness_heartbeatfailures[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow replication"
description: "Heartbeat failures detected."
七、日志管理 #
7.1 日志配置 #
yaml
# 日志配置 (cockroach start 参数)
--log-dir=/var/log/cockroach
--log-file-max-size=100MB
--log-file-max-backups=10
--log-group-id=cluster1
7.2 日志分析 #
bash
# 查看日志
cockroach debug zip --host=localhost:8080 debug.zip
# 分析慢查询日志
grep "slow" /var/log/cockroach/cockroach.log
# 分析错误日志
grep "ERROR" /var/log/cockroach/cockroach.log
八、总结 #
监控告警要点:
| 组件 | 说明 |
|---|---|
| Prometheus | 指标采集和存储 |
| Grafana | 可视化仪表板 |
| Alertmanager | 告警通知 |
| Web UI | 内置监控界面 |
下一步,让我们学习扩缩容!
最后更新:2026-03-27