监控与告警 #

一、监控概述 #

1.1 监控层次 #

text

监控层次：
├── 基础设施监控
│   ├── CPU使用率
│   ├── 内存使用率
│   └── 网络流量
├── 数据库监控
│   ├── 查询性能
│   ├── 连接数
│   └── 存储使用
├── 应用监控
│   ├── 请求延迟
│   ├── 错误率
│   └── 吞吐量
└── 业务监控
    ├── 用户活动
    └── 业务指标

1.2 监控工具 #

text

监控工具：
├── CloudWatch Metrics
├── CloudWatch Logs
├── CloudWatch Alarms
├── CloudWatch Dashboards
└── AWS X-Ray

二、CloudWatch指标 #

2.1 关键指标 #

text

关键指标：
├── CPUUtilization：CPU使用率
├── FreeableMemory：可用内存
├── VolumeReadIOPs：存储读IOPS
├── VolumeWriteIOPs：存储写IOPS
├── GremlinRequestsPerSec：Gremlin请求/秒
├── GremlinHttp100s：100ms内响应数
├── GremlinHttp200s：200ms内响应数
├── GremlinHttp400s：400ms内响应数
├── GremlinHttp500s：500ms以上响应数
├── GremlinErrors：Gremlin错误数
├── SparqlRequestsPerSec：SPARQL请求/秒
├── SparqlErrors：SPARQL错误数
└── TotalRequestPerSec：总请求/秒

2.2 查看指标 #

bash

# 查看CPU使用率
aws cloudwatch get-metric-statistics \
  --namespace AWS/Neptune \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average

# 查看请求延迟
aws cloudwatch get-metric-statistics \
  --namespace AWS/Neptune \
  --metric-name GremlinHttp200s \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

2.3 自定义指标 #

python

# 发布自定义指标
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApplication/Neptune',
    MetricData=[
        {
            'MetricName': 'QueryLatency',
            'Value': 150,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {
                    'Name': 'QueryType',
                    'Value': 'Gremlin'
                }
            ]
        }
    ]
)

三、日志管理 #

3.1 启用日志 #

bash

# 启用审计日志
aws neptune modify-db-cluster \
  --db-cluster-identifier my-neptune-cluster \
  --cloudwatch-logs-export-configuration '{"EnableLogTypes":["audit"]}'

# 启用慢查询日志
aws neptune modify-db-cluster \
  --db-cluster-identifier my-neptune-cluster \
  --cloudwatch-logs-export-configuration '{"EnableLogTypes":["slowquery"]}'

3.2 日志类型 #

text

日志类型：
├── 审计日志（audit）
│   ├── 连接事件
│   ├── 查询事件
│   └── 权限变更
├── 慢查询日志（slowquery）
│   └── 慢查询详情
├── 错误日志（error）
│   └── 错误信息
└── 引擎日志（engine）
    └── 引擎事件

3.3 查询日志 #

bash

# 查询CloudWatch日志
aws logs filter-log-events \
  --log-group-name /aws/neptune/my-cluster/audit \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --filter-pattern "[timestamp, level, message]"

3.4 日志分析 #

python

# Python日志分析
import boto3

logs = boto3.client('logs')

response = logs.filter_log_events(
    logGroupName='/aws/neptune/my-cluster/audit',
    startTime=int((datetime.now() - timedelta(hours=1)).timestamp() * 1000),
    filterPattern='ERROR'
)

for event in response['events']:
    print(event['message'])

四、告警配置 #

4.1 创建告警 #

bash

# CPU使用率告警
aws cloudwatch put-metric-alarm \
  --alarm-name neptune-high-cpu \
  --alarm-description "CPU usage over 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/Neptune \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --period 60 \
  --statistic Average \
  --alarm-actions arn:aws:sns:region:account:my-alerts

# 内存使用率告警
aws cloudwatch put-metric-alarm \
  --alarm-name neptune-low-memory \
  --alarm-description "Freeable memory under 1GB" \
  --metric-name FreeableMemory \
  --namespace AWS/Neptune \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --threshold 1000000000 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 3 \
  --period 60 \
  --statistic Average \
  --alarm-actions arn:aws:sns:region:account:my-alerts

4.2 查询延迟告警 #

bash

# 高延迟告警
aws cloudwatch put-metric-alarm \
  --alarm-name neptune-high-latency \
  --alarm-description "High query latency" \
  --metric-name GremlinHttp500s \
  --namespace AWS/Neptune \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --period 60 \
  --statistic Sum \
  --alarm-actions arn:aws:sns:region:account:my-alerts

4.3 错误率告警 #

bash

# 错误率告警
aws cloudwatch put-metric-alarm \
  --alarm-name neptune-high-errors \
  --alarm-description "High error rate" \
  --metric-name GremlinErrors \
  --namespace AWS/Neptune \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --period 60 \
  --statistic Sum \
  --alarm-actions arn:aws:sns:region:account:my-alerts

五、仪表板配置 #

5.1 创建仪表板 #

bash

# 创建CloudWatch仪表板
aws cloudwatch put-dashboard \
  --dashboard-name Neptune-Monitoring \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0,
        "y": 0,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
            ["AWS/Neptune", "CPUUtilization", "DBInstanceIdentifier", "my-neptune-primary"],
            [".", "FreeableMemory", ".", "."]
          ],
          "period": 60,
          "stat": "Average",
          "region": "us-east-1",
          "title": "Instance Metrics"
        }
      },
      {
        "type": "metric",
        "x": 0,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "metrics": [
            ["AWS/Neptune", "GremlinRequestsPerSec", "DBInstanceIdentifier", "my-neptune-primary"],
            [".", "GremlinHttp100s", ".", "."],
            [".", "GremlinHttp200s", ".", "."],
            [".", "GremlinHttp400s", ".", "."]
          ],
          "period": 60,
          "stat": "Sum",
          "region": "us-east-1",
          "title": "Query Performance"
        }
      }
    ]
  }'

5.2 仪表板组件 #

text

仪表板组件：
├── CPU使用率图表
├── 内存使用率图表
├── 查询延迟图表
├── 错误率图表
├── 连接数图表
├── 存储使用图表
└── 告警状态

六、事件通知 #

6.1 SNS通知 #

bash

# 创建SNS主题
aws sns create-topic --name neptune-alerts

# 订阅邮件
aws sns subscribe \
  --topic-arn arn:aws:sns:region:account:neptune-alerts \
  --protocol email \
  --notification-endpoint admin@example.com

6.2 EventBridge规则 #

bash

# 创建事件规则
aws events put-rule \
  --name neptune-events \
  --event-pattern '{
    "source": ["aws.neptune"],
    "detail-type": ["Neptune Event"]
  }'

# 添加目标
aws events put-targets \
  --rule neptune-events \
  --targets '[
    {
      "Id": "1",
      "Arn": "arn:aws:sns:region:account:neptune-alerts"
    }
  ]'

七、性能监控 #

7.1 查询性能监控 #

gremlin

// 使用profile分析查询
g.V().hasLabel('person').out('knows').profile()

// 获取查询统计
g.V().hasLabel('person').out('knows').profile().next()

7.2 连接监控 #

bash

# 查看连接数
aws cloudwatch get-metric-statistics \
  --namespace AWS/Neptune \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-neptune-primary \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

八、最佳实践 #

8.1 监控策略 #

text

监控策略建议：
├── 监控关键指标
├── 设置合理阈值
├── 配置多级告警
├── 定期审查告警
└── 保持告警有效性

8.2 告警策略 #

text

告警策略建议：
├── 关键告警立即通知
├── 警告告警批量通知
├── 避免告警疲劳
├── 记录告警历史
└── 定期优化阈值

九、总结 #

监控与告警要点：

项目	说明
指标监控	CloudWatch Metrics
日志管理	CloudWatch Logs
告警配置	CloudWatch Alarms
仪表板	CloudWatch Dashboards
通知	SNS, EventBridge

最佳实践：

监控关键指标
配置合理告警
启用审计日志
创建监控仪表板
定期审查告警

恭喜你完成Amazon Neptune学习之旅！