Amazon DocumentDB 监控与告警 #

一、监控概述 #

1.1 监控内容 #

text

监控维度：
├── 性能指标
│   ├── CPU使用率
│   ├── 内存使用
│   ├── 连接数
│   └── 查询性能
│
├── 存储指标
│   ├── 存储使用量
│   ├── IOPS
│   └── 延迟
│
├── 网络指标
│   ├── 网络吞吐
│   └── 连接数
│
└── 复制指标
    ├── 复制延迟
    └── 副本状态

1.2 监控工具 #

text

监控工具：
├── Amazon CloudWatch
├── AWS CloudTrail
├── DocumentDB事件
├── Performance Insights
└── 自定义监控脚本

二、CloudWatch指标 #

2.1 实例级指标 #

指标	说明
CPUUtilization	CPU使用率
FreeableMemory	可用内存
DatabaseConnections	数据库连接数
ReadIOPS	读IOPS
WriteIOPS	写IOPS
ReadLatency	读延迟
WriteLatency	写延迟
ReadThroughput	读吞吐量
WriteThroughput	写吞吐量

2.2 集群级指标 #

指标	说明
VolumeBytesUsed	存储使用量
VolumeReadIOPs	存储读IOPS
VolumeWriteIOPs	存储写IOPS
DMLThroughput	DML吞吐量
DDLThroughput	DDL吞吐量
BufferCacheHitRatio	缓存命中率

2.3 复制指标 #

指标	说明
ReplicationLag	复制延迟
DMLReplicationLatency	DML复制延迟
DDLReplicationLatency	DDL复制延迟

2.4 查看指标 #

bash

# 查看CPU使用率
aws cloudwatch get-metric-statistics \
  --namespace AWS/DocDB \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-primary \
  --statistics Average \
  --period 300 \
  --start-time 2024-01-15T00:00:00Z \
  --end-time 2024-01-15T23:59:59Z

三、CloudWatch告警 #

3.1 创建CPU告警 #

bash

# CPU使用率告警
aws cloudwatch put-metric-alarm \
  --alarm-name docdb-high-cpu \
  --alarm-description "CPU使用率超过80%" \
  --metric-name CPUUtilization \
  --namespace AWS/DocDB \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=DBInstanceIdentifier,Value=my-primary \
  --alarm-actions arn:aws:sns:us-east-1:123:my-alerts

3.2 创建内存告警 #

bash

# 可用内存告警
aws cloudwatch put-metric-alarm \
  --alarm-name docdb-low-memory \
  --alarm-description "可用内存低于1GB" \
  --metric-name FreeableMemory \
  --namespace AWS/DocDB \
  --statistic Average \
  --period 300 \
  --threshold 1073741824 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=DBInstanceIdentifier,Value=my-primary \
  --alarm-actions arn:aws:sns:us-east-1:123:my-alerts

3.3 创建连接数告警 #

bash

# 连接数告警
aws cloudwatch put-metric-alarm \
  --alarm-name docdb-high-connections \
  --alarm-description "连接数超过阈值" \
  --metric-name DatabaseConnections \
  --namespace AWS/DocDB \
  --statistic Average \
  --period 300 \
  --threshold 1000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=DBInstanceIdentifier,Value=my-primary \
  --alarm-actions arn:aws:sns:us-east-1:123:my-alerts

3.4 创建延迟告警 #

bash

# 复制延迟告警
aws cloudwatch put-metric-alarm \
  --alarm-name docdb-replication-lag \
  --alarm-description "复制延迟超过10秒" \
  --metric-name ReplicationLag \
  --namespace AWS/DocDB \
  --statistic Average \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --dimensions Name=DBClusterIdentifier,Value=my-cluster \
  --alarm-actions arn:aws:sns:us-east-1:123:my-alerts

四、CloudWatch仪表板 #

4.1 创建仪表板 #

bash

# 创建监控仪表板
aws cloudwatch put-dashboard \
  --dashboard-name DocumentDB-Monitoring \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0,
        "y": 0,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
            ["AWS/DocDB", "CPUUtilization", "DBInstanceIdentifier", "my-primary"],
            [".", "FreeableMemory", ".", "."]
          ],
          "period": 300,
          "stat": "Average",
          "region": "us-east-1",
          "title": "实例性能"
        }
      }
    ]
  }'

4.2 推荐监控面板 #

text

推荐监控内容：
├── CPU和内存使用率
├── 数据库连接数
├── 读写IOPS和延迟
├── 存储使用量
├── 复制延迟
├── 缓存命中率
└── 网络吞吐量

五、事件监控 #

5.1 查看事件 #

bash

# 查看集群事件
aws docdb describe-events \
  --source-type db-cluster \
  --source-identifier my-cluster \
  --duration 60

# 查看实例事件
aws docdb describe-events \
  --source-type db-instance \
  --source-identifier my-primary \
  --duration 60

5.2 事件订阅 #

bash

# 创建事件订阅
aws docdb create-event-subscription \
  --subscription-name docdb-events \
  --sns-topic-arn arn:aws:sns:us-east-1:123:my-alerts \
  --source-type db-cluster \
  --event-categories '["creation","deletion","failover","failure"]'

# 可用事件类别
# creation, deletion, modification, failover, failure, maintenance, notification

5.3 查看订阅 #

bash

# 查看事件订阅
aws docdb describe-event-subscriptions

# 删除事件订阅
aws docdb delete-event-subscription \
  --subscription-name docdb-events

六、日志管理 #

6.1 启用日志导出 #

bash

# 启用日志导出到CloudWatch
aws docdb modify-db-cluster \
  --db-cluster-identifier my-cluster \
  --enable-cloudwatch-logs-exports '["audit","profiler"]' \
  --apply-immediately

6.2 日志类型 #

text

日志类型：
├── audit - 审计日志
│   ├── 认证事件
│   ├── 授权事件
│   └── 数据操作
│
└── profiler - 性能日志
    ├── 慢查询
    ├── 执行计划
    └── 性能统计

6.3 查看日志 #

bash

# 查看CloudWatch日志
aws logs tail /aws/docdb/my-cluster/audit \
  --since 1h

# 搜索日志
aws logs filter-log-events \
  --log-group-name /aws/docdb/my-cluster/audit \
  --filter-pattern "ERROR"

6.4 日志分析 #

javascript

// CloudWatch Logs Insights查询
// 查找慢查询
fields @timestamp, @message
| filter @message like /slow/
| sort @timestamp desc
| limit 100

// 统计错误类型
fields @message
| parse @message "operation: *," as operation
| stats count() by operation
| sort count desc

七、Performance Insights #

7.1 启用Performance Insights #

bash

# 启用Performance Insights
aws docdb modify-db-instance \
  --db-instance-identifier my-primary \
  --enable-performance-insights \
  --performance-insights-retention-period 7 \
  --apply-immediately

7.2 查看性能数据 #

bash

# 获取性能数据
aws pi get-resource-metrics \
  --service-type DOCDB \
  --identifier my-primary \
  --metric-queries '[{"Metric":"db.load.avg"}]' \
  --start-time 2024-01-15T00:00:00Z \
  --end-time 2024-01-15T23:59:59Z \
  --period 300

7.3 性能分析维度 #

text

分析维度：
├── 等待事件
├── SQL语句
├── 主机资源
├── 用户活动
└── 数据库状态

八、自定义监控 #

8.1 自定义指标 #

javascript

// 发送自定义指标到CloudWatch
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch();

async function putCustomMetric(metricName, value) {
  await cloudwatch.putMetricData({
    Namespace: 'Custom/DocumentDB',
    MetricData: [{
      MetricName: metricName,
      Value: value,
      Unit: 'Count',
      Dimensions: [{
        Name: 'ClusterName',
        Value: 'my-cluster'
      }]
    }]
  }).promise();
}

// 示例：监控慢查询数量
const slowQueryCount = await getSlowQueryCount();
await putCustomMetric('SlowQueryCount', slowQueryCount);

8.2 健康检查脚本 #

javascript

// 健康检查脚本
async function healthCheck() {
  const client = new MongoClient(uri);
  
  try {
    await client.connect();
    
    // 检查连接
    const start = Date.now();
    await client.db('admin').command({ ping: 1 });
    const latency = Date.now() - start;
    
    // 检查副本状态
    const status = await client.db('admin').command({ replSetGetStatus: 1 });
    
    // 发送指标
    await putCustomMetric('ConnectionLatency', latency);
    await putCustomMetric('ReplicaCount', status.members.length);
    
    return {
      healthy: true,
      latency,
      replicas: status.members.length
    };
    
  } catch (error) {
    await putCustomMetric('HealthCheckFailed', 1);
    return {
      healthy: false,
      error: error.message
    };
  } finally {
    await client.close();
  }
}

九、告警响应 #

9.1 告警处理流程 #

text

告警处理流程：
├── 1. 接收告警通知
├── 2. 确认告警级别
├── 3. 分析告警原因
├── 4. 采取处理措施
├── 5. 验证处理结果
└── 6. 记录处理过程

9.2 常见告警处理 #

text

CPU告警：
├── 检查查询负载
├── 优化慢查询
├── 增加索引
├── 扩展实例规格
└── 添加只读副本

内存告警：
├── 检查连接数
├── 优化查询
├── 调整工作集
├── 扩展实例规格
└── 检查内存泄漏

连接数告警：
├── 检查连接池配置
├── 检查连接泄漏
├── 优化连接管理
└── 扩展实例规格

延迟告警：
├── 检查查询性能
├── 优化索引
├── 检查网络
└── 检查资源使用

十、监控最佳实践 #

10.1 监控策略 #

text

监控策略：
├── 设置关键指标告警
├── 建立监控仪表板
├── 定期审查告警阈值
├── 记录基线性能
└── 持续优化监控

10.2 告警策略 #

text

告警策略：
├── 设置合理的阈值
├── 避免告警疲劳
├── 分级告警处理
├── 建立值班制度
└── 定期演练响应

10.3 日志策略 #

text

日志策略：
├── 启用必要的日志
├── 设置日志保留期
├── 建立日志分析流程
├── 定期审查日志
└── 保护日志安全

十一、总结 #

11.1 关键指标 #

类别	指标
性能	CPU、内存、连接数
存储	IOPS、延迟、使用量
复制	复制延迟
网络	吞吐量、连接数

11.2 最佳实践总结 #

text

监控最佳实践：
├── 建立完善的监控体系
├── 设置合理的告警阈值
├── 定期审查和优化
├── 建立响应流程
└── 持续改进

下一步，让我们学习安全配置！