监控运维 #

一、监控概述 #

1.1 监控维度 #

text

InfluxDB监控维度：

服务状态
├── 进程状态
├── 端口监听
├── HTTP响应
└── 服务可用性

资源使用
├── CPU使用率
├── 内存使用
├── 磁盘IO
└── 网络流量

性能指标
├── 写入速率
├── 查询延迟
├── 请求成功率
└── 并发连接数

存储指标
├── 数据大小
├── Series数量
├── 压缩率
└── 磁盘空间

1.2 监控工具 #

text

监控工具选择：

内置监控
├── _internal数据库
├── /metrics端点
└── /health端点

外部工具
├── Prometheus
├── Grafana
├── Telegraf
└── 自定义脚本

二、健康检查 #

2.1 HTTP健康检查 #

bash

# 基本健康检查
curl -I http://localhost:8086/health

# 预期响应
HTTP/1.1 200 OK
Content-Type: application/json

# 详细健康检查
curl http://localhost:8086/health

2.2 CLI健康检查 #

bash

# 使用influx ping
influx ping

# 检查服务状态
influx ping --host http://localhost:8086

2.3 自动健康检查脚本 #

bash

#!/bin/bash
# health_check.sh

INFLUX_URL="http://localhost:8086"
ALERT_WEBHOOK="https://hooks.slack.com/services/xxx"

# 健康检查
response=$(curl -s -o /dev/null -w "%{http_code}" "$INFLUX_URL/health")

if [ "$response" != "200" ]; then
    # 发送告警
    curl -X POST "$ALERT_WEBHOOK" \
        --header "Content-Type: application/json" \
        --data "{\"text\":\"InfluxDB健康检查失败: HTTP $response\"}"
    
    # 尝试重启服务
    systemctl restart influxdb
    
    exit 1
fi

echo "健康检查通过"
exit 0

三、内置监控 #

3.1 _internal数据库 #

flux

// 查询写入统计
from(bucket: "_internal/monitor")
    |> range(start: -1h)
    |> filter(fn: (r) => r._measurement == "write")

// 查询查询统计
from(bucket: "_internal/monitor")
    |> range(start: -1h)
    |> filter(fn: (r) => r._measurement == "query")

// 查询HTTP统计
from(bucket: "_internal/monitor")
    |> range(start: -1h)
    |> filter(fn: (r) => r._measurement == "httpd")

3.2 /metrics端点 #

bash

# 获取Prometheus格式指标
curl http://localhost:8086/metrics

# 常用指标
# influxdb_write_points_total - 总写入点数
# influxdb_httpd_requests_total - 总请求数
# influxdb_storage_series_count - Series数量

3.3 启用监控 #

toml

# 配置文件
[monitor]
  store-enabled = true
  store-database = "_internal"
  store-interval = "10s"

四、Prometheus监控 #

4.1 配置Prometheus #

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'influxdb'
    static_configs:
      - targets: ['localhost:8086']
    metrics_path: '/metrics'

4.2 关键指标 #

text

关键监控指标：

写入指标
├── influxdb_write_points_total
├── influxdb_write_errors_total
└── influxdb_write_dropped_total

查询指标
├── influxdb_httpd_requests_total
├── influxdb_httpd_query_requests_total
└── influxdb_httpd_query_duration_seconds

存储指标
├── influxdb_storage_series_count
├── influxdb_storage_tsm_files_count
└── influxdb_storage_disk_usage_bytes

资源指标
├── process_cpu_seconds_total
├── process_resident_memory_bytes
└── process_open_fds

4.3 Grafana仪表板 #

text

Grafana仪表板配置：

数据源
├── 类型：Prometheus
├── URL：http://prometheus:9090
└── 访问模式：Server

仪表板面板
├── 服务状态
│   └── 健康检查、运行时间
├── 写入性能
│   └── 写入速率、错误率
├── 查询性能
│   └── 查询延迟、QPS
├── 存储使用
│   └── 数据大小、Series数
└── 资源使用
    └── CPU、内存、磁盘

五、日志管理 #

5.1 日志配置 #

toml

# 配置文件
[logging]
  level = "info"           # 日志级别
  format = "json"          # 日志格式
  path = "/var/log/influxdb/influxd.log"
  max-size = "100MB"
  max-age = "7d"
  max-backups = 5
  compress = true

5.2 日志级别 #

text

日志级别说明：

error
├── 错误信息
├── 服务异常
└── 需要立即处理

warn
├── 警告信息
├── 潜在问题
└── 建议关注

info
├── 运行信息
├── 正常操作
└── 默认级别

debug
├── 调试信息
├── 详细日志
└── 开发调试用

5.3 日志查看 #

bash

# 查看服务日志
journalctl -u influxdb -f

# 查看文件日志
tail -f /var/log/influxdb/influxd.log

# Docker日志
docker logs -f influxdb

# 过滤错误日志
journalctl -u influxdb | grep -i error

5.4 日志分析 #

bash

# 分析请求日志
cat /var/log/influxdb/influxd.log | jq 'select(.msg == "request")'

# 统计错误类型
cat /var/log/influxdb/influxd.log | jq -r '.level' | sort | uniq -c

# 分析慢查询
cat /var/log/influxdb/influxd.log | jq 'select(.duration > 1000000000)'

六、告警配置 #

6.1 告警规则 #

yaml

# Prometheus告警规则
groups:
  - name: influxdb
    rules:
      - alert: InfluxDBDown
        expr: up{job="influxdb"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "InfluxDB服务不可用"
          
      - alert: InfluxDBHighMemory
        expr: process_resident_memory_bytes{job="influxdb"} > 8589934592
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "InfluxDB内存使用过高"
          
      - alert: InfluxDBWriteErrors
        expr: rate(influxdb_write_errors_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "InfluxDB写入错误率过高"

6.2 告警通知 #

yaml

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  receiver: 'team-notifications'
  
receivers:
  - name: 'team-notifications'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

七、运维命令 #

7.1 服务管理 #

bash

# 启动服务
systemctl start influxdb

# 停止服务
systemctl stop influxdb

# 重启服务
systemctl restart influxdb

# 查看状态
systemctl status influxdb

# 开机自启
systemctl enable influxdb

7.2 数据管理 #

bash

# 查看Bucket大小
influx bucket list

# 查看Series数量
influx query 'import "influxdata/influxdb"
influxdb.cardinality(bucket: "my-bucket", start: -30d)'

# 清理过期数据
influx delete \
    --bucket my-bucket \
    --start 2024-01-01T00:00:00Z \
    --stop 2024-01-31T00:00:00Z

# 删除特定数据
influx delete \
    --bucket my-bucket \
    --predicate '_measurement="cpu" AND host="server01"' \
    --start 2024-01-01T00:00:00Z \
    --stop 2024-01-31T00:00:00Z

7.3 性能诊断 #

bash

# 查看运行时统计
curl http://localhost:8086/debug/pprof/

# CPU分析
curl http://localhost:8086/debug/pprof/profile > cpu.prof

# 内存分析
curl http://localhost:8086/debug/pprof/heap > heap.prof

# 使用go tool分析
go tool pprof cpu.prof

八、容量规划 #

8.1 存储容量估算 #

text

存储容量估算公式：

数据大小 = 数据点大小 × 每秒数据点数 × 保留时间 × 压缩比

示例：
├── 数据点大小：约100字节
├── 每秒数据点：10000
├── 保留时间：30天
├── 压缩比：0.1
└── 数据大小 = 100 × 10000 × 86400 × 30 × 0.1 ≈ 259GB

8.2 资源规划 #

text

资源规划建议：

写入速率
├── < 10万点/秒：4核CPU，8GB内存
├── 10-50万点/秒：8核CPU，16GB内存
└── > 50万点/秒：16核CPU，32GB内存

存储空间
├── 预留30%缓冲空间
├── SSD存储优先
└── 考虑备份空间

网络带宽
├── 写入带宽：数据大小 × 1.5
├── 查询带宽：根据查询频率
└── 预留50%余量

九、故障排查 #

9.1 常见问题 #

text

常见问题及解决：

服务无法启动
├── 检查端口占用
├── 检查权限
├── 检查配置文件
└── 查看日志

写入失败
├── 检查Token权限
├── 检查Bucket存在
├── 检查数据格式
└── 检查磁盘空间

查询慢
├── 优化查询语句
├── 检查索引使用
├── 增加资源
└── 检查并发数

内存占用高
├── 检查Series基数
├── 调整缓存配置
├── 增加内存
└── 重启服务

9.2 排查流程 #

bash

#!/bin/bash
# troubleshoot.sh

echo "=== 服务状态 ==="
systemctl status influxdb

echo "=== 端口监听 ==="
netstat -tlnp | grep 8086

echo "=== 磁盘空间 ==="
df -h /var/lib/influxdb2

echo "=== 内存使用 ==="
free -h

echo "=== 最近错误 ==="
journalctl -u influxdb -n 20 --no-pager | grep -i error

echo "=== 连接数 ==="
netstat -an | grep 8086 | wc -l

十、总结 #

监控运维要点：

健康检查：定期检查服务状态
指标监控：使用Prometheus收集指标
日志管理：配置合适的日志级别
告警配置：设置关键指标告警
故障排查：建立排查流程

下一步，让我们学习性能优化！