监控告警 #

一、监控概述 #

1.1 监控架构 #

text

监控架构
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   CockroachDB 集群                                          │
│   ├── Node 1 (/_status/vars)                               │
│   ├── Node 2 (/_status/vars)                               │
│   └── Node 3 (/_status/vars)                               │
│         │                                                   │
│         ▼                                                   │
│   Prometheus                                                │
│   ├── 抓取指标                                             │
│   ├── 存储时序数据                                         │
│   └── 告警规则                                             │
│         │                                                   │
│         ▼                                                   │
│   Grafana                                                   │
│   ├── 可视化仪表板                                         │
│   └── 告警通知                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1.2 监控指标 #

指标类别	说明
系统指标	CPU、内存、磁盘、网络
集群指标	节点状态、Range分布
SQL指标	查询延迟、吞吐量
存储指标	写入速度、压缩率

二、Prometheus集成 #

2.1 配置Prometheus #

yaml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'cockroachdb'
    metrics_path: '/_status/vars'
    static_configs:
      - targets:
          - 'node1:8080'
          - 'node2:8080'
          - 'node3:8080'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'

rule_files:
  - 'cockroachdb_alerts.yml'

2.2 告警规则 #

yaml

# cockroachdb_alerts.yml
groups:
  - name: cockroachdb
    rules:
      - alert: CockroachDBNodeDown
        expr: up{job="cockroachdb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CockroachDB node down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute."

      - alert: CockroachDBHighLatency
        expr: histogram_quantile(0.99, rate(sql_latency_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High SQL latency"
          description: "99th percentile latency is above 1 second."

      - alert: CockroachDBLowReplicas
        expr: count(replicas) < 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low replica count"
          description: "Replica count is below 3."

      - alert: CockroachDBHighDiskUsage
        expr: (storage_usage / storage_capacity) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk usage"
          description: "Disk usage is above 80%."

      - alert: CockroachDBSlowQueries
        expr: rate(sql_latency_count[5m]) > 100 and histogram_quantile(0.95, rate(sql_latency_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries detected"
          description: "High rate of slow queries detected."

三、Grafana仪表板 #

3.1 导入仪表板 #

text

导入官方仪表板:
1. 访问 Grafana
2. 导航到 Dashboards > Import
3. 输入仪表板 ID: 10908 (CockroachDB Official)
4. 选择 Prometheus 数据源
5. 点击 Import

官方仪表板:
- CockroachDB Overview: 10908
- CockroachDB Runtime: 10909
- CockroachDB SQL: 10910
- CockroachDB Storage: 10911

3.2 自定义仪表板 #

json

{
  "dashboard": {
    "title": "CockroachDB Custom Dashboard",
    "panels": [
      {
        "title": "QPS",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(sql_latency_count[1m])",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(sql_latency_bucket[5m]))",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Replica Count",
        "type": "stat",
        "targets": [
          {
            "expr": "count(replicas)",
            "legendFormat": "Total"
          }
        ]
      },
      {
        "title": "Disk Usage",
        "type": "gauge",
        "targets": [
          {
            "expr": "storage_usage / storage_capacity * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ]
  }
}

四、关键指标 #

4.1 SQL指标 #

yaml

# SQL 关键指标

# 查询吞吐量 (QPS)
sql_latency_count

# 查询延迟
sql_latency_bucket
sql_latency_sum

# 活跃连接数
sql_conns

# 事务统计
txn_commits
txn_aborts
txn_restarts

4.2 存储指标 #

yaml

# 存储关键指标

# 存储容量
storage_capacity
storage_usage

# 写入速度
storage_write_bytes

# 读取速度
storage_read_bytes

# 压缩率
storage_compaction_bytes

4.3 集群指标 #

yaml

# 集群关键指标

# 节点存活
up

# Range数量
replicas

# Leader数量
lease_holder

# 副本状态
replicas_quiescent
replicas_leaseholders

五、Web UI监控 #

5.1 内置监控 #

text

访问 Web UI:
http://localhost:8080

主要功能:
├── Overview: 集群概览
├── Metrics: 详细指标
├── Databases: 数据库管理
├── SQL Activity: SQL活动
├── Jobs: 后台任务
├── Advanced Debug: 高级调试
└── Settings: 设置

5.2 SQL活动监控 #

sql

-- 查看SQL统计
SELECT 
    query,
    count,
    mean_latency,
    max_latency
FROM crdb_internal.node_statement_statistics
ORDER BY count DESC
LIMIT 10;

-- 查看活跃事务
SELECT * FROM crdb_internal.cluster_transactions;

-- 查看慢查询
SELECT 
    query,
    mean_latency / 1000000 AS latency_ms
FROM crdb_internal.node_statement_statistics
WHERE mean_latency > 1000000000  -- 超过1秒
ORDER BY mean_latency DESC;

六、告警配置 #

6.1 Alertmanager配置 #

yaml

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'team-email'
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
    - match:
        severity: warning
      receiver: 'team-email'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true

  - name: 'team-pagerduty'
    pagerduty_configs:
      - service_key: 'your-service-key'
        severity: critical

6.2 告警规则示例 #

yaml

groups:
  - name: cockroachdb_critical
    rules:
      - alert: CockroachDBClusterDown
        expr: count(up{job="cockroachdb"} == 1) < 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CockroachDB cluster critical"
          description: "Less than 2 nodes are up. Cluster may be unavailable."

      - alert: CockroachDBReplicaUnavailability
        expr: count(replicas) < 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replica unavailability"
          description: "Some ranges have less than 3 replicas."

  - name: cockroachdb_warning
    rules:
      - alert: CockroachDBHighMemoryUsage
        expr: process_resident_memory_bytes / (1024 * 1024 * 1024) > 16
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is above 16GB."

      - alert: CockroachDBSlowReplication
        expr: rate(liveness_liveness_heartbeatfailures[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow replication"
          description: "Heartbeat failures detected."

七、日志管理 #

7.1 日志配置 #

yaml

# 日志配置 (cockroach start 参数)
--log-dir=/var/log/cockroach
--log-file-max-size=100MB
--log-file-max-backups=10
--log-group-id=cluster1

7.2 日志分析 #

bash

# 查看日志
cockroach debug zip --host=localhost:8080 debug.zip

# 分析慢查询日志
grep "slow" /var/log/cockroach/cockroach.log

# 分析错误日志
grep "ERROR" /var/log/cockroach/cockroach.log

八、总结 #

监控告警要点：

组件	说明
Prometheus	指标采集和存储
Grafana	可视化仪表板
Alertmanager	告警通知
Web UI	内置监控界面

下一步，让我们学习扩缩容！